Follow the Data

A data driven blog

Notes on genomics APIs #3: SolveBio

This is the third in a short series of posts with notes on different genomics APIs. The first post, which was about the One Codex API, can be found here, and the second one, about Google Genomics, can be found here.

SolveBio “delivers the critical reference data used by hospitals and companies to run genomic applications”, according to their web page. They focus on clinical genomics and on helping developers who need to access various data sources in a programmatic way. Their curated data library provides access to (as of February 2015) “over 300 datasets for genomics, proteomics, literature annotation, variant-disease relationships, and more.) Some examples of those datasets are the ClinVar disease gene database from NIH, the Somatic Mutations dataset from The Cancer Genome Atlas, and the COSMIC catalogue of somatic mutations in cancer.

SolveBio offers a RESTful API with Python and Ruby clients already available and an R client under development. The Getting Started Guide really tells you most of what you need to know to use it, but let’s try it out here on this blog anyway!

You should, of course, start by signing up for a free account. After that, it’s time to get the client. I will use the Python one in this post. It can be installed by giving this command:

curl -skL | bash

You can also install it with pip.

Now you will need to login. This will prompt you for your email and password that you registered when signing up.

solvebio login

At this point you can view a useful tutorial by giving solvebio tutorial. The tutorial explains the concept of depositories, which are versioned containers for data sets. For instance (as explained in the docs), there is a ClinVar depository which (as of version 3.1.0) has three datasets: ClinVar, Variants, and Submissions. Each dataset within a depository is designed for a specific use-case. For example, the Variants dataset contains data on genomic variants, and supports multiple genome builds.

Now start the interactive SolveBio shell. This shell (in case you followed the instructions above) is based on iPython.


The command Depository.all() will show the available depositories. Currently, the list looks like this (you’ll want to click the image to blow it up a bit):
Screen Shot 2015-02-04 at 15.29.50

In a similar way, you can view all the data sets with Dataset.all(). Type Dataset.all(latest=True) to view only the latest additions.

To work with a data set, you need to ‘retrieve’ it with a command like:

ds = Dataset.retrieve('ClinVar/3.1.0-2015-01-13/Variants')

It is perfectly possible to leave out the version of the data set: ds = Dataset.retrieve('ClinVar/Variants') but that is bad practice from a reproducibility viewpoint and is not recommended, especially in production code.

Now we can check which fields are available in the ds object representing the data set we selected.


There are fields for things like alternate alleles for the variant in question, sources of clinical information on the variant, the name of any gene(s) overlapping the variant, and the genomic coordinates for the variant.

You can create a Python iterator for looping through all the records (variants) using ds.query(). To view the first variant, type ds.query()[0]. This will give you an idea of how each record (variant) is described in this particular data set. In practice, you will almost always want to filter your query according to some specified criteria. So for example, to look for known pathogenic variants in the titin (TTN) gene, you could filter as follows:

ttn_vars = ds.query().filter(clinical_significance='Pathogenic', gene_symbol_hgnc='TTN')

This will give you an iterator with a bunch of records (currently 18) that you can examine in more detail.

If you want to search for variants in some specified genomic region that you have identified as interesting, you can do that too, but it is only possible for some data sets. In this case it turns out that we can do it with this version of the ClinVar variant data set, because it is considered a “genomic” data set, which we can see because the command ds.is_genomicreturns True. (Some of the older versions return False here.)

ds.query(genome_build='GRCh37').range('chr3', 22500000, 23000000)

Note that you can specify a genome build in the query, which is very convenient.

Moving on to a different depository and data set, we can search for diabetes-related variants as defined via genome wide association studies with something like the following:

ds = Dataset.retrieve('GWAS/1.0.0-2015-01-13/GWAS')
ds.fields() # Check out which fields are available
ds.query().filter(phenotype='diabetes') # Also works with "Diabetes"
ds.query().filter(journal='science',phenotype='diabetes') # Only look for diabetes GWAS published in Science

Also, giving a command likeDataset.retrieve('GWAS/1.0.0-2015-01-13/GWAS').help() will open up a web page describing the dataset in your browser.

Notes on genomics APIs #2: Google Genomics API

This is the second in a series of about three posts with notes on different genomics APIs. The first post, which was about the One Codex API, can be found here.

As you may have heard, Google has started building an ambitious infrastructure for storing and querying genomic data, so I was eager to start exploring it. However, as there were a number of tools available, I initially had some trouble wrapping my head around what I was supposed to do. I hope these notes, where I mainly use the API for R, can provide some help.

Some useful bookmarks:

Google Developers Console – for creating and managing Google Genomics and BigQuery projects.

Google Genomics GitHub repo

Google Cloud Platform Google Genomics page (not sure what to call this page really)

Getting started

You should start by going to the Developer Console and creating a project. You will need to give it a name, and in addition it will be given a unique ID which you can use later in API calls. When the project has been created, click “Enable an API” on the Dashboard page, and click the button where it says “OFF” next to Genomics API (you may need to scroll down to find it).

Now you need to create a client_secret.json file that you will use for some API calls. Click the Credentials link in the left side panel and then click “Create new client ID”. Select “Installed application” and fill in the “Consent screen” form. All you really need to do is select an email address and type a “product name”, like “BlogTutorial” like I did for this particular example. Select “Installed application” again if you are prompted to select an application type. Now it should display some information under the heading “Client ID for native application”. Click the “Download JSON” button and rename the file to client_secret.json. (I got these instructions from here.)

Using the Java API client for exploring the data sets

One of the first questions I had was how to find out which datasets are actually available for querying. Although it is perfectly possible to click around in the Developer Console, I think the most straightforward way currently is to use the Java API client. I installed it from the Google Genomics GitHub repo by cloning:
git clone
The GitHub repo page contains installation instructions, but I will repeat them here. You need to compile it using Maven:

cd api-client-java
mvn package

If everything goes well, you should now be able to use the Java API client to look for datasets. It is convenient (but not necessary) to put the client_secret.json file into the same directory as the Java API client. Let’s check which data sets are available (this will only work for projects where billing has been enabled; you can sign up for a free trial in which case you will not be surprise-billed):

java -jar genomics-tools-client-java-v1beta2.jar listdatasets --project_number 761052378059 --client_secrets_filename client_secret.json

(If your client_secret.json file is in another directory, you need to give the full path to the file, of course.) The project number is shown on your project page in the Developer Console. Now, the client will open a browser window where you need to authenticate. You will only need to do this the first time. Finally, the results are displayed. They currently look like this:

Platinum Genomes (ID: 3049512673186936334)
1000 Genomes - Phase 3 (ID: 4252737135923902652)
1000 Genomes (ID: 10473108253681171589)

So there are three data sets. Now let’s check which reference genomes are available:

java -jar genomics-tools-client-java-v1beta2.jar searchreferencesets --client_secrets_filename ../client_secret.json --fields 'referenceSets(id,assemblyId)'

The output is currently:


To find out the names of the chromosomes/contigs in one of the reference genomes: (by default this will only return the ten first hits, so I specify –count 50)

java -jar genomics-tools-client-java-v1beta2.jar searchreferences --client_secrets_filename client_secret.json  --fields 'references(id,name)' --reference_set_id EMWV_ZfLxrDY-wE --count 50

Now we can try to extract a snippet of sequence from one of the chromosomes. Chromosome 9 in hg19 had the ID EIeX4KDCl634Jw, so the query becomes, if we want to extract some sequence from 13 Mbases into the chromosome:

java -jar genomics-tools-client-java-v1beta2.jar getreferencebases  --client_secrets_filename client_secret.json --reference_id ENywqdu-wbqQBA --start 13000000 --end 13000070


Another thing you might want to do is to check which “read groups” that are available in one of the data sets. For instance, for the Platinum Genomes data set we get:

java -jar genomics-tools-client-java-v1beta2.jar searchreadgroupsets --dataset_id 3049512673186936334  --client_secrets_filename client_secret.json

which outputs a bunch of JSON records that show the corresponding sample name, BAM file, internal IDs, software and version used for alignment to the reference genome, etc.

Using BigQuery to search Google Genomics data sets

Now let’s see how we can call the API from R. The three data sets mentioned above can be queried using Google’s BigQuery interface, which allows SQL-like queries to be run on very large data sets. Start R and install and load some packages:

install.packages("devtools") # unless you already have it!

Now we can access BigQuery through R. Try one of the non-genomics data sets just to get warmed up.

project <- '(YOUR_PROJECT_ID)' # the ID of the project from the Developer Console
sql <- 'SELECT title,contributor_username,comment FROM[publicdata:samples.wikipedia] WHERE title contains "beer" LIMIT 100;'
data <- query_exec(sql, project)

Now the data object should contain a list of Wikipedia articles about beer. If that worked, move on to some genomic queries. In this case, I decided I wanted to look at the SNP for the photic sneeze reflex (the reflex that makes people such as myself sneeze when they go out on a sunny day) that 23andme discovered via their user base. That genetic variant has the ID and is located on chromosome 2, base 146125523 in the hg19 reference genome. It seems that 23andme uses a 1-based coordinate system (the first nucleotide has the index 1) while Google Genomics uses a 0-based system, so we should look for base position 146125522 instead. We query the Platinum Genomes variant table: (you can find the available tables at the BigQuery Browser Tool Page)

sql <- 'SELECT reference_bases,alternate_bases FROM[genomics-public-data:platinum_genomes.variants] WHERE reference_name="chr2" AND start=146125522 GROUP BY reference_bases,alternate_bases;'
query_exec(sql, project)

This shows the following output:

reference_bases alternate_bases
1 C T

This seems to match the description provided by 23andme; the reference allele is C and the most common alternate allele is T. People with CC have slightly higher odds of sneezing in the sun, TT people have slightly lower odds, and people with CT have average odds.

If we query for the variant frequencies (VF) in the 13 Platinum genomes, we get the following results (the fraction represents, as I interpret it, the fraction of sequencing reads that has the “alternate allele”, in this case T):

sql <- 'SELECT call.call_set_name,call.VF FROM[genomics-public-data:platinum_genomes.variants] WHERE reference_name="chr2" AND start=146125522;'
query_exec(sql, project)

The output is as follows:

call_call_set_name call_VF
1 NA12882 0.500
2 NA12877 0.485
3 NA12889 0.356
4 NA12885 1.000
5 NA12883 0.582
6 NA12879 0.434
7 NA12891 1.000
8 NA12888 0.475
9 NA12886 0.434
10 NA12884 0.459
11 NA12893 0.588
12 NA12878 0.444
13 NA12892 0.533

So most people here seem to have a mix of C and T, with two individuals (NA12891 and NA12885) having all T:s, in other words they appear to be homozygous for the T allele, if I am interpreting this correctly.

Using the R API client

Now let’s try to use the R API client. In R, install the client from GitHub, and also ggbio and ggplot2 if you don’t have them already:


First we need to authenticate for this R session:

authenticate(file="/path/to/client_secret.json") # substitute the actual path to your client_secret.json file

The Google Genomics GitHub repo page has some examples on how to use the R API. Let’s follow the Plotting Alignments example.

reads <- getReads(readGroupSetId="CMvnhpKTFhDyy__v0qfPpkw",

This will fetch reads corresponding to the given genomic interval (which turns out to overlap a gene called KL) in the read group set called CMvnhpKTFhDyy__v0qfPpkw. By applying one of the Java API calls shown above and grepping for this string, I found out that this corresponds to a BAM file for a Platinum Genomes sample called NA12893.

We need to turn thereadslist into a GAlignment object:

alignments <- readsToGAlignments(reads)

Now we can plot the read coverage over the region using some ggbio functions.

alignmentPlot <- autoplot(alignments, aes(color=strand,fill=strand))
coveragePlot <- ggplot(as(alignments, 'GRanges')) + stat_coverage(color="gray40", fill="skyblue")
tracks(alignmentPlot, coveragePlot, xlab="Reads overlapping for NA12893")

As in the tutorial, why not also visualize the part of the chromosome where we are looking.

ideogramPlot <- plotIdeogram(genome="hg19", subchr="chr13")
ideogramPlot + xlim(as(alignments, 'GRanges'))


Now you could proceed with one of the other examples, for instance the variant annotation comparison example, which I think is a little bit too elaborate to reproduce here.

Notes on genomics APIs #1: One Codex

This is the first in a series of about three posts with notes on different genomics APIs.

One Codex calls itself “a genomic search engine, enabling new and valuable applications in clinical diagnostics, food safety, and biosecurity”. They have built a data platform where you can rapidly (much more quickly than with e.g. BLAST) match your sequences against an indexed reference database containing a large collection of bacterial, viral and fungal genomes. They have a good web interface for doing the search but have also introduced an API. I like to use command-line APIs in order to wrap things into workflows, so I decided to try it. Here are some notes on how you might use it.

This service could be useful when you want to identify contamination or perhaps the presence of some infectious agent in a tissue sample, but the most obvious use case is perhaps for metagenomics (when you have sequenced a mixed population of organisms). Let’s go to to the EBI Metagenomics site, which keeps a directory of public metagenomics data sets. Browsing through the list of projects, we see an interesting looking one: the Artisanal Cheese Metagenome. Let’s download one of the sequence files for that. Click the sample name (“Artisanal cheeses”), then click the Download tab. Now click “Submitted nucleotide reads (ENA website)”. There are two gzipped FASTQ files here – I arbitrarily choose to download the first one [direct link]. This download is 353 Mb and took about 10 minutes on my connection. (If you want a lighter download, you could try the 100 day old infant gut metagenome which is only about 1 Mb in size.)

The artisanal cheese metagenome file contains about 2 million sequences. If you wanted to do this analysis properly, you would probably want to run some de novo assembly tool which is good at metagenomics assembly such as IDBA-UD, Megahit, etc on it, but since my aim here is not to do a proper analysis but just show how to use the One Codex API, I will just query One Codex with the raw sequences.

I am going to use the full data set of 2M sequences. However, if you want to select a subset of let’s say 10,000 sequences in order to get results a bit faster, you could do like this:

gzcat Sample2a.fastq.gz | tail +4000000 | head -40000 > cheese_subset.fastq

(Some explanation is in order. In a FASTQ file, each sequence entry consists of four lines. Thus, we want to pick 40,000 lines in order to get 10,000 sequences. The tail +4000000 part of the command makes the selection start 1 million sequences into the file, that is, at 4 million lines. I usually avoid taking the very first sequences when choosing subsets of FASTQ files, because there are often poor sequences there due to edge effects in the sequencer flow cells. So now you would have selected 10,000 sequences from approximately the middle of the file.)

Now let’s try to use One Codex to see what the artisanal cheese metagenome contains. First, you need to register for a One Codex account, and then you need to apply for an API key (select Request a Key from the left hand sidebar).

You can use the One Codex API via curl, but there is also a convenient Python-based command-line client, which, however, only seems to work with Python 2 so far (a Python 3 version is under development). If you don’t want to use Python 2 (which should be easy enough using virtual environments), you’ll have to refer to the API documentation for how to do the curl calls. In these notes, I will use the command-line client. The installation should be as easy as:

pip install onecodex

Now we can try to classify the contents of our sample. In my case, the artisanal cheese metagenome file is called Sample2a.fastq.gz. We can query the One Codex API with gzipped files, so we don’t need to decompress it. First we need to be authenticated (at this point I am just following the tutorial here):

onecodex login

You will be prompted for your API key, which you’ll find under the Settings on the One Codex web site.

You can now list the available commands:

onecodex --help

which should show something like this:

usage: onecodex [-h] [--no-pretty-print] [--no-threads] [--max-threads N]
[--api-key API_KEY] [--version]
{upload,samples,analyses,references,logout,login} ...
One Codex Commands:
upload Upload one or more files to the One Codex platform
samples Retrieve uploaded samples
analyses Retrieve performed analyses
references Describe available Reference databses
logout Delete your API key (saved in ~/.onecodex)
login Add an API key (saved in ~/.onecodex)
One Codex Options:
-h, --help show this help message and exit
--no-pretty-print Do not pretty-print JSON responses
--no-threads Do not use multiple background threads to upload files
--max-threads N Specify a different max # of N upload threads
(defaults to 4)
--api-key API_KEY Manually provide a One Codex Beta API key
--version show program's version number and exit

Upload the sequences to the platform:

onecodex upload Sample2a.fastq.gz

This took me about five minutes – if you are using a small file like the 100-day infant gut metagenome it will be almost instantaneous. If we now give the following command:

onecodex analyses

it will show something similar to the following:

"analysis_status": "Pending",
"id": "6845bd3fa31c4c09",
"reference_id": "f5a3d51131104d7a",
"reference_name": "RefSeq 65 Complete Genomes",
"sample_filename": "Sample2a.fastq.gz",
"sample_id": "d4aff2bdf0db47cd"
"analysis_status": "Pending",
"id": "974c3ef01d254265",
"reference_id": "9a61796162d64790",
"reference_name": "One Codex 28K Database",
"sample_filename": "Sample2a.fastq.gz",
"sample_id": "d4aff2bdf0db47cd"

where the “analysis_status” of “Pending” indicates that the sample is still being processed. There are two entries because the sequences are being matched against two databases: the RefSeq 65 Complete Genomes and the One Codex 28K Database. According to the web site, “The RefSeq 65 Complete Genomes database […] includes 2718 bacterial genomes and 2318 viral genomes” and the “expanded One Codex 28k database includes the RefSeq 65 database as well as 22,710 additional genomes from the NCBI repository, for a total of 23,498 bacterial genomes, 3,995 viral genomes and 364 fungal genomes.”

After waiting for 10-15 minutes or so (due to some very recently added parallelization capabilities it should only take 4-5 minutes now), the “analysis_status” started showing “Success”. Now we can look at the results. Let’s check out the One Codex 28K database results. You just need to call onecodex analyses with the “id” value shown in one of the outputs above.

bmp:OneCodex mikaelhuss1$ onecodex analyses 974c3ef01d254265
"analysis_status": "Success",
"id": "974c3ef01d254265",
"n_reads": 2069638,
"p_mapped": 0.21960000000000002,
"reference_id": "9a61796162d64790",
"reference_name": "One Codex 28K Database",
"sample_filename": "Sample2a.fastq.gz",
"sample_id": "d4aff2bdf0db47cd",
"url": ""

So One Codex managed to assign a likely source organism to about 22% of the sequences. There is a URL to the results page. This URL is by default private to the user who created the analysis, but One Codex has recently added functionality to make results pages public if you want to share them, so I did that: Artisanal Cheese Metagenome Classification. Feel free to click around and explore the taxonomic tree and the other features.

You can also retrieve your analysis results as a JSON file:

onecodex analyses 974c3ef01d254265 --table > cheese.json

We see that the most abundantly detected bacterium in this artisanal cheese sample was Streptococcus macedonicus, which makes sense as that is a dairy isolate frequently found in fermented dairy products such as cheese.

e-infrastructures for data management and deep learning for NLP

This week, I spent two days at the Workshop on e-Infrastructures for Massively Parallel Sequencing in Uppsala. It was a nice event with sessions on workflow systems, virtual machines, DNA sequencing data management experiences in various countries etc. Rather than summarizing the takeaways, I’d like to point to a Storify made by Surya Saha which includes a lot of tweets from yours truly. I sometimes live-tweet talks from events but I don’t think I’ve ever gotten so many retweets and new followers before. I guess I underestimated how interested people are on workflow management and pipelining systems.

Tonight, I went to a meetup on deep learning and natural language processing hosted by the Stockholm NLP Meetup group. It was a fun session led by Roelof Pieters which included hands-on programming with Theano and some tantalizing glimpses on what you can do with word embeddings and distributed semantics. A recording can be found here and the slides are here. There’s also a GitHub repo with some iPython notebooks containing useful example code in Theano.

“The future belongs to those who own the robots”

I wanted to share a very interesting blog post that Wilhelm Landerholm (@landerholm) wrote today on his Swedish blog, because I think it is worth a wider audience. Google Translate does a pretty good job of translating it (as Daniel Nicorici (@daniel2nico) pointed out on Twitter), but there were a few rough spots, so I produced a lightly edited version of it below with permission from the author. So now I give the word to Wilhelm – please enjoy! (For the record, the statement about robots which also occurs in the name of this blog post was written by Mattias Östmar (@mattiasostmar) on Twitter before appearing in the blog post.)

The Future
Posted on January 21, 2015

This wonderful future, and this constant discussion of what’s to come. There any many prophets, and many who say, in response to events, “What did I tell you?” Naturally without mentioning all those times they were wrong. What will that prophet that recommends you to fix the interest rate at 3.78 percent today say in the future?

In my world, you build models to explain future events. It is therefore always a bit funny to brainstorm about the future of data analysis and the ever so popular concept of Big Data, because what is described is often actually the world we live in today. As an example, today I saw a slide that began with the words, “Those who are best at Big Data are those who will be the best at operations”. But that is not the future – it is a description of how it is today.

I’ve built so many models I would have lost count if I hadn’t documented them. Several of them have been the difference between survival or death of a business. These results achieved by myself and my colleagues have also led to my profession being described as a sexy future profession. I hesitate to concur. Today, no one within my area talks about “creating analysis teams” within the company, even if this is what the Swedish companies are currently doing. Those in my world are busy seeking interfaces between my day-to-day reality and customers. The focus is on automation, on AI and Machine Learning. I am one of those who believe that the future does not have room for the Data Scientist of today. The future belongs to those who can control “robots”. And I am not talking about robots like R2D2, but robots like IBM’s Watson.

Today, I build models for large corporations. For companies where the effects of my work can be enormous. But in the future, these solutions will not restricted to a few large corporations. They will be a part of everything we do. The Internet of Things will generate such volumes of data that analysis must be moved from BIG to MICRO. This will, in turn, mean that the analysis must be automated. This is where we work on the future today. Today I am putting together “self-driving companies” in a Raspberry Pi and a smartphone. In a few years, we will see self-driving cars. However, I am not sure that there will be a market for everything that moves, and that’s why those of us who were early to jump on the Big Data and the Internet of Things train are now looking for contacts with those who own demand.

No matter how many people out there are shouting that the company must build its own analysis team, a BIG DATA UNIT – you’re ten years behind. Because the work that I have done for several customers can now be replaced with a computer for 500 SEK [about 60 USD] at Kjell & Co [a common Swedish chain store for electronic peripherals] and a few [predictive?] models. I focus on being the supplier of the models. The future belongs to those who can automate the analysis. For those who own the robots. Along with those who own demand, they will change, to a large extent, how business is done.

Deep learning and genomics: the splicing code [and breast cancer features]

Last summer, I wrote a little bit about potential applications of deep learning to genomics. What I had in mind then was (i) to learn a hierarchy of cell types based on single-cell RNA sequencing data (with gene expression measures in the form of integers or floats as inputs) and (ii) to discover features in metagenomics data (based on short sequence snippets; k-mers). I had some doubts regarding the latter application because I was not sure how much the system could learn from short k-mers. Well, now someone has tried deep learning from DNA sequence features!

Let’s back up a little bit. One of many intriguing questions in biology is exactly how splicing works. A lot is known about the rules controlling it but not everything. A recent article in Science, The human splicing code reveals new insights into the genetic determinants of disease (unfortunately paywalled), used a machine learning approach (ensembles of neural networks) to predict splicing events and the effects of single-base mutations on the same using only DNA sequence information as input. Melissa Gymrek has a good blog post on the paper, so I won’t elaborate too much. Importantly though, in this paper the features are still hand-crafted (there are 1393 sequence based features).

In an extension of this work, the same group used deep learning to actually learn the features from the sequence data. Hannes Bretschneider posted this presentation from NIPS 2014 describing the work, and it is very interesting. They used a convolutional network that was able to discover things like the reading frame (the three-nucleotide periodicity resulting from how amino acids are encoded in protein-coding DNA stretches) and known splicing signals.

They have also made available a GPU-accelerated deep learning library for DNA sequence data for Python: Hebel. Right now it seems like only feedforward nets are available (not the convolutional nets mentioned in the talk). I am currently trying to install the package on my Mac.

Needless to say, I think this is a very interesting development and I hope to try this approach on some entirely different problem.

Edit 2015-01-06. Well, what do you know! Just found out that my suggestion (i) has been tried as well. At the currently ongoing PSB’15 conference, Jie Tan has presented work using a denoising autoencoder network to learn a representation of breast cancer gene expression data. The learned features were shown to represent things like tumor vs. normal tissue status, estrogen receptor (ER) status and molecular subtypes. I had thought that there wasn’t enough data yet to support this kind of approach (and even told someone who suggested using The Cancer Genome Atlas [TCGA] data as much at a data science workshop last month – this work uses TCGA data as well as data from METABRIC), and the authors remark in the paper that it is surprising that the method seems to work so well. Previously my thinking was that we needed to await the masses of single-cell gene expression data that are going to come out in the coming years.

Data themed podcasts

I’ve started to follow a number of data-themed podcasts:

  • O’Reilly Data Show. This is a newish spin-off from the O’Reilly Radar Podcasts, led by O’Reilly’s chief data scientist, Ben Lorica. The first three episodes have dealt with “the science of moving dots” (analyzing sports videos), Bitcoin analytics, and Kafka.
  • Partially Derivative. This bills itself as a podcast about “data, data science and awesomeness”, and also seems to have a beer side theme (check out the logo, for instance). I have only heard the latest episode (#7), which I liked.
  • The Data Skeptic Podcast. A very good podcast with frequent episodes. Some are “minisodes”, shorter shows where the host usually discusses some specific statistical issue or model with his wife Linhda, and some are longer interviews with people who talk about surprisingly interesting topics such as “data myths”, how to make unbiased dice, how algorithms can help people find love, etc.
  • Data Stories. This one focuses on data visualization and interviews with strong practitioners of the same. Definitely worth a listen.

I also used to be a fan of The Cloud of Data, which does not seem to be updated anymore. (?)  Are there any other good ones?

Edit 2015-01-04. There is a brand new machine learning podcast called Talking Machines, which doesn’t seem to be on iTunes yet. It seems like a must-listen based on the guest list in the first show (Hanna Wallach [of the much-linked Big Data, Machine Learning, and the Social Sciences: Fairness, Accountability, and Transparency talk],  Kevin Murphy [who wrote one of the most highly praised textbooks in the area], Max Welling [co-chair of the NIPS conference], and the three deep learning giants Yann LeCun, Yoshoua Bengio and Geoff Hinton).

I also found another machine learning themed podcast from Udacity called Linear Digressions.

Data science workshop at Stockholm university

I spent last Thursday and Friday (Dec. 4-5 2014) in Kista just outside of Stockholm at a data science workshop hosted by professor Henrik Boström at the Department of Computer and Systems Sciences at Stockholm University. It was a nice opportunity to get to know some new data science people from around Sweden (the geographical spread was decent) and whom I haven’t encountered at my usual hunting grounds (mainly meetups.) This workshop was perhaps more oriented towards academic data scientists (“Most of the people here have published” as Henrik said), although there were participants from companies such as Scania and SAAB as well.

The meeting was very enjoyable and we heard presentations on topics such as predictive car maintenance, mining electronic health records for unknown drug side effects, anomaly prediction of elderly people’s movement patterns in their apartments, pinpointing dangerous mutations in flu viruses and much more. On the second day, we were divided into groups where we discussed various topics. I participated in two discussion groups. The first one was about computational life sciences, where we discussed the balance between method development and pure applied analysis in biological data science (bioinformatics). We also talked about the somewhat paradoxical situations that occur in high-throughput biological studies, where careful statistical analysis performed on a large dataset is often only considered believable after experimental validation on a much smaller (and clearly more biased) sample. Perhaps other types of validation, such as independent testing against previously published data, would often be more useful. We also argued that performing (good!) analyses on already published data should be valued much higher than it is today, when “novel data” are often required for publication.

The second discussion was about the comprehensibility or interpretability of classifiers. According to one of the participants, Leo Breiman (who invented random forests, among other things), had said when asked about the most pressing problems in data mining: “There are three: interpretability, interpretability, and interpretability!” (I’m paraphrasing) Domain experts often prefer models that can give clear and convincing reasons for predictions, that have a small number of relevant features, and that (in some cases) emulate human decision making in the way they produce predictions. However, interpretability or comprehensibility is subjective and hard to measure – and there isn’t even any good terminology or definition of it (which is exemplified by the fact that I am using a couple of different words for it here!) It’s likely that this problem will become more and more pressing as machine learning tools become more popular and the number of non-expert users increases. This was a very interesting discussion where a lot of ideas were thrown around – cost-sensitive methods where you have a feature acquisition cost (“pay extra” for features), posting models on Mechanical Turk and asking if they are comprehensible, gamification where people have to use different models to solve a problem, hybrid classifiers like decision trees where some nodes might contain a whole neural network etc.

Other groups discussed, e g, automatized experimentation and experimental design and the possible establishment of a PhD program in data science.

One idea that came up a few time throughout the meeting and which I hadn’t heard about before was the concept of conformal learning prediction. Apparently this was introduced in the book Algorithmic learning in a random world (2005). As far as I understood, conformal prediction methods are concerned with quantifying their own accuracy. These methods supposedly provide reliable bounds for how they will generalize to new data. The price you pay for this reliability is that you don’t get a point estimate as your prediction; instead of a predicted class (in classification) you might get more than one possible class, and instead of a single number (in regression) you get an interval. One of the participants likened this to a confidence interval, but a general one which does not assume a specific distribution of the data. Using this framework, you can get p values for your predictions and perform hypothesis tests.

There were also some nice ideas about how to embed various types of data as graphs, and how to validate graphs induced from biological experiments – another typ of “generalized confidence interval” where the aim is to find “reliable” subnetworks based on repeated subsampling.

Quick notes

  • I’ve found the Data Skeptic to be a nice podcast about data science and related subjects. For example, the “data myths” episode and the one with Matthew Russell (who wrote Mining the Social Web) are fun.
  • When I was in China last month, the seat pocket in front of me in the cab we took from the Beijing airport had a glossy magazine in it. The first feature article was about big data (大数据) analysis applied to Chinese TV series and movies, Netflix-style. Gotta beat those Korean dramas! One of the hotels we stayed in Beijing had organized an international conference on big data analytics the day before we arrived at the hotel. The signs and posters were still there. Anecdotes, not data, but still.
  • November was a good meetup month in Stockholm. The Machine Learning group had another good event at Spotify HQ, with interesting presentations from Watty , both about how to “data bootstrap” a startup when you discover that the existing data you’ve acquired is garbage and need to start generating your own in a hurry, and about the actual nitty gritty details of their algorithms (which model and predict energy consumption from different devices in households by deconvoluting a composite signal), and also about embodied cognition and robotics by Jorge Davila-Chacon (slides here). Also, in an effort to revive the Stockholm Big Data group, I co-organized (together with Stefan Avestad from Ericsson) a meetup with Paco Nathan on Spark. The slides for the talk, which was excellent and extremely appreciated by the audience, can be found here. Paco also gave a great workshop the next day on how to actually use Spark. Finally, I’ve joined the organizing committee of SRUG, the Stockholm R useR group, and have started to plan some future meetups there. The next one will be on December 9 and will deal with how Swedish governmental organizations use R.
  • Erik Bernhardsson of Spotify has written a fascinating blog post combining two of my favorite subjects: chess and deep learning. He has trained a 3 layer deep and 2048 unit wide network on 100 million games from FICS (the Free Internet Chess Server, where I, incidentally, play quite often). I’ve often thought about why it seems to be so hard to build a chess engine that really learns the game from scratch, using actual machine learning, rather than the rule- and heuristic based programs that have ruled the roost, and which have been pre-loaded with massive opening libraries and endgame tablebases (giving the optimal move in any position with less than N pieces; I think that N is currently about =<7). It would be much cooler to have a system that just learns implicitly how to play and does not rely on knowledge. Well, Erik seems to have achieved that, kind of. The cool thing is that this program does not need to be told explicitly how the pieces move; it can infer it from data. Since the system is using amateur games, it sensibly enough does not care about the outcome of each game (that would be a weak label for learning). I do think that Erik is a bit optimistic when he writes that “Still, even an amateur player probably makes near-optimal moves for most time.” Most people who have analyzed their own games, or online games, with a strong engine know that amateur games are just riddled with blunders. (I remember the old Max Euwe book “Chess master vs chess amateur”, which also demonstrated this convincingly … but I digress).  Still, a very impressive demonstration! I once supervised a master’s thesis where the aim was to teach a neural network to play some specific endgames, and even that was a challenge. As Erik notes in his blog post, his system needs to be tried against a “real” chess engine. It is reported to score around 33% against Sunfish, but that is a fairly weak engine, as I found out by playing it half and hour ago.

Playing with Swedish election data

The Swedish general elections were held on September 14, 2014, and resulted in a messy parliamentary situation in which the party receiving the most votes, the Social Democrats, had a hard time putting together a functioning government. The previous right-wing (by Swedish standards) government led by the Moderates was voted out after eight years in power. The most discussed outcome was that the nationalist Sweden Democrats party surged to 12.9% of the vote, up from about 5% in the 2010 elections.

I read data journalist Jens Finnäs’ interesting blog post “Covering election night with R“. He was covering the elections with live statistical analysis on Twitter. I approached him afterwards and asked for the data sets he had put together on voting results and various characteristics of the voting municipalities, in order to do some simple statistical analysis of voting patterns, and he kindly shared them with me (they are also available on GitHub, possibly in a slightly different form, I haven’t checked).

What I wanted to find out was whether there was a clear urban-rural separation in voting patterns and whether a principal component analysis would reveal a “right-left” axis and a “traditional-cosmopolitan” axis corresponding to a schematic that I have seen a couple of times now. I also wanted to see if a random forest classifier would be able to predict the vote share of the Sweden Democrats, or some other party, in a municipality, based only on municipality descriptors.

There are some caveats in this analysis. For example, we are dealing with compositional data here in the voting outcomes: all the vote shares must sum to one (or 100%). That means that neither PCA nor random forests may be fully appropriate. Wilhelm Landerholm pointed me to an interesting paper about PCA for compositional data. As for the random forests, I suppose I should use some form of multiple-output RF, which could produce one prediction per party, but since this was a quick and dirty analysis, I just did it party by party.

The analysis is available as a document with embedded code and figures at Rpubs, or on GitHub if you prefer that. You’ll have to go there to see the plots, but some tentative “results” that I took away were:

  • There are two axes where one (PC1) can be interpreted as a Moderate – Social Democrat axis (rather than a strict right vs left axis), and one (PC2) that can indeed be interpreted as a traditionalist – cosmopolitan axis, with the Sweden Democrats at one end, and the Left party (V) (also to some extent the environmental party, MP, the Feminist initiative, FI, and the liberal FP) at the other end.
  • There is indeed a clear difference between urban and rural voters (urban voters are more likely to vote for the Moderates, rural voters for the Social democrats).
  • Votes for the Sweden Democrats are also strongly geographically determined, but here it is more of a gradient along Sweden’s length (the farther north, the less votes – on average – for SD).
  • Surprisingly (?), the reported level of crime in a municipality doesn’t seem to affect voting patterns at all.
  • A random forest classifier can predict votes for a party pretty well on unseen data based on municipality descriptors. Not perfectly by any means, but pretty well.
  • The most informative features for predicting SD vote share were latitude, longitude, proportion of motorcycles, and proportion of educated individuals.
  • The most informative feature for predicting Center party votes was the proportion of tractors :) Likely a confounder/proxy for rural-ness.

There are other things that would be cool to look at, such as finding the most “atypical” municipalities based on the RF model. Also there is some skew in the RF prediction scatter plots that should be examined. I’ll leave it as is for now, and perhaps return to it at some point.

Edit 2015-01-03. I read a post at the Swedish Cornucopia blog, which points out that the number of asylum seekers per capita is positively correlated with SD votes (Pearson r~0.32) and negatively correlated (r~-0.34) with votes for the moderates, M). The author thought this was highly significant but I felt that there were probably more important indicators. I therefore downloaded data on asylum seekers per capita, which had been put together based on combining the Migration Board’s statistics on asylum seekers from December 2014 with Statistics Sweden’s population statistics, and introduced this as a new indicator in my models. I pushed the updated version to GitHub. My interpretation based on the PCA and random forest analyses is that the number of asylum seekers per capita is not among the most important indicators for explaining the SD vote share.

Post Navigation


Get every new post delivered to your Inbox.

Join 133 other followers