Follow the Data

A data driven blog

Archive for the category “Uncategorized”

App for exploring brain region specific gene expression

(Short version: explore region-specific gene expression in two human brains at

The Allen Institute for Brain Science has done a tremendous amount of work to digitalize and make available information on gene expression at a fine-grained level both in the mouse brain and the human brain. The Allen Brain Atlas contains a lot of useful information on the developing brain in mouse and human, the aging brain, etc. – both via GUIs and an API.

Among other things, the Allen institute has published gene expression data for healthy human brains divided by brain structure, assessed using both microarrays and RNA sequencing. In the RNA-seq case (which I have been looking at for reasons outlined below), two brains have been sectioned into 121 different parts, each representing one of many anatomical structures. This gives “region-specific” expression data which are quite useful for other researchers who want to compare their brain gene expression experiments to publicly available reference data. Note that each of the defined regions will still be a mix of cell types (various kinds of neuron, astrocytes, oligodendrocytes etc.), so we are still looking at a mix of cell types here, although resolved into brain regions. (Update 2016-07-22: The recently released R package ABAEnrichment seems very useful for a more programmatic approach than the one described here to accessing information about brain structure and cell type specific genes in Allen Brain Atlas data!)

As I have been working on a few projects concerning gene expression in the brain in some specific disease states, there has been a need to compare our own data to “control brains” which are not (to our knowledge) affected by any disease. In one of the projects, it has also been of interest to compare gene expression profiles to expression patterns in specific brain regions. As these projects both used RNA sequencing as their method of quantifying gene (or transcript) expression, I decided to take a closer look at the Allen Institute brain RNA-seq data and eventually ended up writing a small interactive app which is currently hosted at (as well as a back-up location available on request if that one doesn’t work.)

Screen Shot 2016-07-23 at 15.03.45

A screenshot of the Allen brain RNA-seq visualization app

The primary functions of the app are the following:

(1) To show lists of the most significantly up-regulated genes in each brain structure (genes that are significantly more expressed in that structure than in others, on average). These lists are shown in the upper left corner, and a drop-down menu below the list marked “Main structure” is used to select the structure of interest. As there are data from two brains, the expression level is shown separately for these in units of TPM (transcripts per million). Apart from the columns showing the TPM for each sampled brain (A and B, respectively), there is a column showing the mean expression of the gene across all brain structures, and across both brains.

(2) To show box plots comparing the distribution of the TPM expression levels in the structure of interest (the one selected in the “Main structure” drop-down menu) with the TP distribution in other structures. This can be done on the level of one of the brains or both. You might wonder why there is a “distribution” of expression values in a structure. The reason is simply that there are many samples (biopsies) from the same structure.

So one simple usage scenario would be to select a structure in the drop-down menu, say “Striatum”, and press the “Show top genes” button. This would render a list of genes topped by PCP4, which has a mean TPM of >4,300 in brain A and >2,000 in brain B, but just ~500 on average in all regions. Now you could select PCP4, copy and paste it into the “gene” textbox and click “Show gene expression across regions.” This should render a (ggplot2) box plot partitioned by brain donor.

There is another slightly less useful functionality:

(3)  The lower part of the screen is occupied by a principal component plot of all of the samples colored by brain structure (whereas the donor’s identity is indicated by the shape of the plotting character.) The reason I say it’s not so useful is that it’s currently hard-coded to show principal components 1 and 2, while I ponder where I should put drop-down menus or similar allowing selection of arbitrary components.

The PCA plot clearly shows that most of the brain structures are similar in their expression profiles, apart from the structures: cerebral cortex, globus pallidus and striatum, which form their own clusters that consist of samples from both donors. In other words, the gene expression profiles for these structures are distinct enough not to get overshadowed by batch or donor effects and other confounders.

I hope that someone will find this little app useful!



Hacking open government data

I spent last weekend with my talented colleagues Robin Andéer and Johan Dahlberg participating in the Hack For Sweden hackathon in Stockholm, where the idea is to find the most clever ways to make use of open data from government agencies. Several government entities were actively supporting and participating in this well-organized though perhaps slightly unfortunately named event (I got a few chuckles from acquaintances when I mentioned my participations.)

Our idea was to use data from Kolada, a database containing more than 2000 KPIs (key performance indicators) for different aspects of life in the 290 Swedish municipalities (think “towns” or “cities”, although the correspondence is not exactly 1-to-1), to get a “birds-eye view” of how similar or different the municipalities/towns are in general. Kolada has an API that allows piecemeal retrieval of these KPIs, so we started by essentially scraping the database (a bulk download option would have been nice!) to get a table of 2,303 times 290 data points, which we then wanted to be able to visualize and explore in an interactive way.

One of the points behind this app is that it is quite hard to wrap your head around the large number of performance indicators, which might be a considerable mental barrier for someone trying to do statistical analysis on Swedish municipalities. We hoped to create a “jumping-board” where you can quickly get a sense on what is distinctive for each municipality and which variables might be of interest, after which a user would be able to go deeper into a certain direction of analysis.

We ended up using the Bokeh library for Python to make a visualization where the user can select municipalities and drill down a little bit to the underlying data, and Robin and Johan cobbled together a web interface (available at  We plotted the municipalities using principal component analysis (PCA) projections after having tried and discarded alternatives like MDS and t-SNE. When the user selects a town in the PCA plot, the web interface displays its most distinctive (i.e. least typical) characteristics. It’s also possible to select two towns and get a list of the KPIs that differ the most between the two towns (based on ranks across all towns). Note that all of the KPIs are named and described in Swedish, which may make the whole thing rather pointless for non-Swedish users.

The code is on GitHub and the current incarnation of the app is at Kommunvis.

Perhaps unsurprisingly, there were lots of cool projects on display at Hack for Sweden. The overall winners were the Ge0Hack3rs team, who built a striking 3D visualization of different parameters for Stockholm (e.g. the density of companies, restaurants etc.) as an aid for urban planners and visitors. A straightforward but useful service which I liked was Cykelranking, built by the Sweco Position team, an index for how well each municipality is doing in terms of providing opportunities for bicycling, including detailed info on bicycle paths and accident-prone locations.

This was the third time that the yearly Hack for Sweden event was held, and I think the organization was top-notch, in large, spacey locations with seemingly infinite supply of coffee, food, and snacks, as well as helpful government agency data specialists in green T-shirts whom you were able to consult with questions. We definitely hope to be back next year with fresh new ideas.

This was more or less a 24-hour hackathon (Saturday morning to Sunday morning), although certainly our team used less time (we all went home to sleep on Saturday evening), yet a lot of the apps built were quite impressive, so I asked some other teams how much they had prepared in advance. All of them claimed not to have prepared anything, but I suspect most teams did like ours did (and for which I am grateful): prepared a little dummy/bare-bones application just to make sure they wouldn’t get stuck in configuration, registering accounts etc. on the competition day. I think it’s a good thing in general to require (as this hackathon did) that the competitors state clearly in advance what they intend to do, and prod them a little bit to prepare in advance so that they can really focus on building functionality on the day(s) of the hackathon instead of fumbling around with installation.



ASCII Autoencoder

Joel and I were playing around with TensorFlow, the deep learning library that Google recently released and that you have no doubt heard of. We had put together a little autoencoder implementation and were trying to get a handle on how well it was working.

An autoencoder can be viewed as a neural network where the final layer, the output layer, is supposed to reconstruct the values that have been fed into the input layer, possibly after some distortion of the inputs (like forcing a fraction of them to be zero, dropout, or adding some random noise). In the case with corrupted, it’s called a denoising autoencoder, and the purpose of adding the noise or dropout is to make the system discover more robust statistical regularities in the input data (there is some good discussion here).

An autoencoder often has fewer nodes in the hidden layer(s) than in the input and is then used to learn a more compact and general representation of the data (the code or encoding). With only one hidden layer and linear activation functions, the encoding should be essentially the same as one gets from PCA (principal component analysis), but non-linear activation functions (e g sigmoid and tanh) will yield different representations, and multi-layer or stacked autoencoders will add a hierarchical aspect.

Some references on autoencoders:

Ballard (1987) – Modular learning in neural networks

Andrew Ng’s lecture notes on sparse autoencoders

Vincent et al (2010) – Stacked denoising autoencoders

Tan et al (2015) – ADAGE analysis of publicly available gene expression data collections illuminates Pseudomonas aeruginosa-host interactions

Anyway, we were trying some different parametrizations of the autoencoder (its training performance can depend quite a lot on how the weights are initialized, the learning rate and the number of hidden nodes) and felt it was a bit boring to just look at a single number (the reconstruction error). We wanted to get a feel for how training is progressing across the input data matrix, so we made the script output for each 1000 rounds of training a colored block of text in the terminal where the background color represents the absolute difference between the target value and the reconstructed value using bins. The “best” bin (bin 0) is dark green and represents that the reconstruction is very close to the original input; the “bad” bins have reddish colors. If the data point has been shifted t0 a new bin in the last 1000 rounds (i e the reconstruction has improved or deteriorated noticeably), a colored digit indicating the new bin is shown in the foreground. (This makes more sense when you actually look at it.) We only show the first 75 training examples and the first 75 features, so if your data set is larger than that you won’t see all of it.

The code is on GitHub. There are command-line switches for controlling the number of hidden nodes, learning rate, and other such things. There are probably many aspects that could be improved but we thought this was a fun way to visualize the progress and see if there are any regions that clearly stand out in some way.

Here are a few screenshots of an example execution of the script.

As the training progresses, the overall picture gets a bit greener (the reconstruction gets closer to the input values) and the reconstructions get a bit more stable (i e not as many values have a digit on them to indicate that the reconstruction has improved or deteriorated). The values under each screenshot indicates the number of training cycles and the mean squared reconstruction error.

Watson hackathon in Uppsala

Today I spent most of the day trying to grok IBM Watson’s APIs during a hackathon (Hackup) in Uppsala, where the aim was to develop useful apps using those APIs. Watson is, of course, famous for being good at Jeopardy and for being at the center for IBM’s push into healthcare analytics, but I hadn’t spent much time before this hackathon checking out exactly what is available to users now in terms of APIs etc. It turned out to be a fun learning experience and I think a good time was had by all.

We used IBM’s Bluemix platform to develop apps. As the available Watson API’s (also including the Alchemy APIs that are now part of Bluemix) are mostly focused on natural language analysis (rather than generic classification and statistical modeling), our team – consisting of me and two other bioinformaticians from Scilifelab – decided to try to build a service for transcribing podcasts (using the Watson Speech To Text API) in order to annotate and tag them using the Alchemy APIs for keyword extraction, entity extraction etc. This, we envisioned, would allow podcast buffs to identify in which episode of their favorite show a certain topic was discussed, for instance. Eventually, after ingesting a large number of podcast episodes, the tagging/annotation might also enable things like podcast recommendations and classification, as podcasts could be compared to each other based on themes and keywords. This type of “thematic mapping” could also be interesting for following a single podcast’s thematic development.

As is often the case, we spent a fair amount of time on some supposedly mundane details. Since the speech-to-text conversion was relatively slow, we tried different strategies to split the audio files and process them in parallel, but could not quite make it work. Still, we ended up with a (Python-based) solution that was indeed able to transcribe and tag podcast episodes, but it’s still missing a front-end interface and a back-end database to hold information about multiple podcast episodes.

There were many other teams who developed cool apps. For instance one team made a little app for voice control of a light switch using a Raspberry Pi, and another team had devised an “AI shopper” that will remind you to buy stuff that you have forgotten to put on your shopping list. One entry was a kind of recommendation system for what education you should pursue, based on comparing a user-submitted text against a model trained on papers from people in different careers, and another one was an app for quantifying the average positive/negative/neutral sentiments found in tweets from different accounts (e.g. NASA had very positive tweets on average whereas BBC News was fairly negative).

All in all, a nice experience, and it was good to take a break from the Stockholm scene and see what’s going on in my old home town. Good job by Jason Dainter and the other organizers!

GitXiv – collaborative open source computer science

Just wanted to highlight GitXiv, an interesting new resource that combines paper pre-print publication, implementation code and a discussion forum in the same space. The About page explains the concept well:

In recent years, a highly interesting pattern has emerged: Computer scientists release new research findings on arXiv and just days later, developers release an open-source implementation on GitHub. This pattern is immensely powerful. One could call it collaborative open computer science (cocs).

GitXiv is a space to share links to open computer science projects. Countless Github and arXiv links are floating around the web. Its hard to keep track of these gems. GitXiv attempts to solve this problem by offering a collaboratively curated feed of projects. Each project is conveniently presented as arXiv + Github + Links + Discussion. Members can submit their findings and let the community rank and discuss it. A regular newsletter makes it easy to stay up-to-date on recent advancements. It´s free and open.

The feed contains a lot of yummy research on things like deep learning, natural language processing and graphs, but GitXiv is not restricted to any particular computer science areas – anything is welcome!

Neural networks hallucinating text

I’ve always been a sucker for algorithmic creativity, so when I saw the machine generated Obama speeches, I immediately wanted to try the same method on other texts. Fortunately, that was easily done by simply cloning the char-rnn repository by Andrej Karpathy, which the Obama-RNN was based on. Andrej has also written a long and really very good introduction to recurrent neural networks if you want to know more about the inner workings of the thing.

I started by downloading an archive of all the posts on this blog and trained a network with default parameters according to the char-rnn instructions. In the training phase, the network tries to learn to predict the next character in the text. Note that it does not (initially) know anything about words or sentences, but learns about those concepts implicitly with training. After training, I let the network hallucinate new blog posts by sampling from the network state (this is also described on the GitHub page). The network can be “primed” with a word or a phrase, and a temperature parameter controls how conservative or daring the network should be when generating new text. Essentially a low temperature will give a repetitive output endlessly rehashing the same concepts (namely, the most probable ones based on the training data) while a high temperature will output more adventurous stuff such as weird new “words” and sometimes imaginary hyperlinks (if links were included in the input).

Here are a few samples from the Follow the Data RNN. You’ll be the judge of how well it captures the blog’s spirit.

Temperature 1:

predictive pullimation; in personal ining the find of R crition Large Liforrsion Sachelity Complents have bio carristed in software predictions on yenogic data is to first general revalumated by people of cholescologion, slides of programs where pluding this onlines it a hagn to be step-nature for genomes with onlowed data data set that from the system that me personito of the being prediction of pacen”</em>.


perhoul what yeverope informatation will likely moveling in langup genetic paring how health reed masses disease out its mames by (I’ve not, one would PDAscoprometain, this proop whothem the astimate all talks be a single wike this blogpors high tho out that company around and brarting provide nest potential cervel gods a sall sult of they are it is now laned to write can be used metagenome have not probable to the imail, the pettoling in an, described

Temperature 0.2:

articles are and self-tracking and company that the command the starts to see that the results of the prediction in the prediction contests about the prediction contests in the contest at the command the data sets of statistics and prediction contests about the presentation of the data set of both say that the results of the presentation is an interesting problems in the concept of the prediction data sets of both samples of data sets of the problem is that is a predictive analysis and some sequences and sense for the talk may be a set of the paper and the sequence profiles are an experience and probably the statistical analysis of analysis statistics and provides and prediction contests in the contests and company that the command the starts of the paper in the prediction probably and better that all the company that the (etc …)

OK, that was not great. Let’s raise the temperature to 0.7 and seed the net with “deep learning”:

maker the runion following how build on our related in atther are algorithms in scientific movement is sequence of the paper in a lot more and results of data posting to see in genetal sequence of discussion in the software completely (with Manalay will be all cantrusualing indelation and know possistration and simplement caper is the lately and the lately of the entroagh the clinical replot on what thoughts in the related network. The currently in the <a href=””>here</a&gt;.

In such as well be a new up some notes stall more results to release which can be used on the process of biolity (or because the problem is to have been every booking the clinical resomm that build of the paper is on the web company relation to me that it up a preteors, machine learning teams. It are blogging index, the exterment). I’ve lidel sentiment methods to call contain variant public.

The interview quarthyre of they asseance in the prediction methods.
A  multiple possible to such fut to this sequence data sets are both the facural statistics of courses where can be complenes and inpreserver the allow predictive results are and more scale some the paper is that the build betted examples are signity.</li>

You’ll note that the net has invented many interesting new words, which is a consequence of the fact that it is generating one character at a time.

For my next experiment, I couldn’t resist putting the collected short stories of JG Ballard (one of my favourite writers) into the network. Here, I used 256 nodes, but perhaps should have trained a bit longer as the results are still a bit shaky (although they have a distinctly Ballardian feel at times). In particular, I am missing words like “canopy”, “perimeter”, and of course the drained swimming pools. Here are some samples:

US DAY, counterlahes to face the films, one programme demolishiking science prison erecting out of the solar aircraft on my mind.

Television made him for itself his pieces of published telescope.

A simple motion of Kennedy heard your legs through the suit within a fortuna from Beach. Angel and London Hinton’s gas. A single tenant of the succession of all the transition limbs ultil to me that three overlooking the space–craft has been insilent.

An assocations of 112–hour long water in front of them all sweak, as if he was saying, his coming at the statue. Its still bays rated like a large physician from the watch–tobe. The ancient birds were simply to embrace her for a long wholly self–conscious police. Nothing two days, children before the cities, Charles Wallari Street calps our overhead, efforts gives up the drums.

Ward shook his head sadly. ‘I don’t felt some due.’

Mongable was like my people seems my fear–spinding beach–car. Yet, but you an overhead time, they’re going to do with the summer seems only in trister held of? I didn’t wasn’t already get to do. If the prayer has questioned much, however, as soon as selfables exhilaration of peaced Franz. Laster had lost impuly as wousen much wave. Perhaps if they meaning on flatper. ‘Let’s go outstands.’ He listened up in the lounge, a acut shifting in and out of the grass and hands. Ryan stood by her mask velvet, neutral on the darkened bedroom. Waist doldent helicopters whisky to stopped the chauffeur on to the water. Simulator was convinced the sculptor of the suffiring moonlight.

At the like the woman stopped and he could pany when it raised the emptines and lined themselves from the stairs, hoods would soon realize the tiled sand–rays of them, like the fourth houses were hidden from the streets, and then spotted in a small features of path Aselage.

At three of the speeding expirsm, Traven returned to me. Once, as I looked at his wife and shields. All afternoons the bel

Or with the max temperature:

Wind’s speaking, Halloway began to climb into the distant. The Sign of the crest of his treasure, like the hidden profile on one of the clocks, rost in the blockhen flourish.

For years before the soared assemble, their fragmented pen with a jewelly confused sector with Bayliss and London on either side with slow he had learned off the sleep. Surprised by the display’s optical relief than there. Accorring you, over–grey manner, the idleg to make on the abandoned gaps of magnine air, welephoring on their rough and owners during myself.

I didn’t have ask pressed to its professional record libbariards and reverse, creatively and given a single–washing suit that had slumped.

As he watched him he left her away from him, presumably slightly old but prismatic apparation of the air, turned back among the beaches.

As we sat down, Slept asleep again, and was sufficioned about in patiently. Taking his fingers, pantic season flew overrounds and satisfied him to lie–accaling with noise. As she sitused, slung at his feet in custom of letters turned on never at falles. Gunfir felt in one hand, he moved away at my carded grass. Her personante stepped off his chest, seemed to have passed. He was in advance with a survoinnes’ crawl. Quickly he regretted to receive while Mrs Osmond – survived the softly steps, pushing abandone in the stone bird. Drill as the houses opposite, whose side–struggle of down–thems alerts prepared to take part of the sea.

Ryker quickly waved him down to his mouth overlooking the raw building, and watched him into a valley behind it. When an interval of hundreds of its impact the Vicionic pores, however, first waiting for my moving palls, for the visual minor curved and shakes test of the town centre. Averything again even he was out of to the building, and by the timewinds like metal landmarsh fingers technique unlimited silk like the few ratts of heels, the unique light from the wheels and speedback on the wound, and continue the s

Using a low temperature (0.2 in this case) we get a lot of stations, statues and chauffeurs:

(the station we have been able to see me to the sound of the station and an essentian of some sort of conscious the sense of a starting concentration of the station and his last few days and the sea and the sea and the sea and set out the station and a series of conscious of the sea. The computers the statue was the special morning the station that had been a million disasted and set off in the corner of the car park. The statue had been transformed into a series of space complex and anti–games in the sun. The first stage of the station was a small car park in the sunlight. The chauffeur was being seen through the shadows of the sky. The bony skin was still standing at the store and started to stand up and down the staircase. He was aware of the chauffeur, and the car park was almost convinced that the station was a small conclusion of the station and a series of experiments in the sense of the sea. The station was almost to himself. He had been a sudden international art of the station that the station was the only way of world was a series of surface. An area of touch with a strange surge of fresh clock and started to stay here to the surrounding bunker. He stood up and stared at her and watched the statue for the first time the statue for the first time and started to stand up and down the stairway to the surface of the stairway. He was about to see me with a single flower, but he was aware of the continuous sight of the statue, and was suffered by the stars of the statue for the first time and the statue for the first time of the sight of the statue in the centre of the car, watching the shore like a demolish of some pathetic material.

That’s it for this time!

Genomics Today and Tomorrow presentation

Below is a Slideshare link/widget to a presentation I gave at the Genomics Today and Tomorrow event in Uppsala a couple of weeks ago (March 19, 2015).

I spoke after Jonathan Bingham of Google Genomics and talked a little bit about how APIs, machine learning, and what I call “querying by dataset” could make life easier for bioinformaticians working on data integration. In particular, I gave examples of a few types of queries that one would like to be able to do against “all public data” (slides 19-24).

Not long after, I saw this preprint (called “Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees”) that seems to provide part of the functionality that I was envisioning – in particular, the ability to query public sequence repositories by content (using a sequence as a query), rather than by annotation (metadata). The beginning of the abstract goes like this:

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments.

Peak meetup in Stockholm?

This month (March 2015) is simply packed with data related meetups in Stockholm. Are we witnessing the peak of interest or can it be sustained? Nice to see that there seems to be a large enough group of enthusiasts to sustain these meetings.

These are the meetups that I am aware of (and in some cases have been to) in Stockholm this month:

March 10. Kickoff for the new Stockholm Deep Learning group. Already held, there is a recording here.

March 11. Google BigQuery 101 with Jordan Tigani, already held. at Google’ Stockholm office.

March 18. Stockholm Hadoop User Group meeting: Druid at Criteo and History of Hadoop at Spotify.

March 18. Big Data, Stockholm.

March 23. Stockholm R useR Group meeting: R in genomics. (I am one of the organizers)

March 23. Stockholm Machine Learning / Spotify Tech Group: Large-scale Machine Learning with Spark and Flink

March 24. Stockholm Natural Language Processing group meeting: Word Embedding, from theory to practice at Gavagai.

March 26. Apache Spark Show and Tell.

That’s seven eight meetups, without me even really looking. Perhaps there are others too? I hope Stockholm can sustain the momentum (and perhaps that it spreads to other cities in the region.)

Notes on genomics APIs #2: Google Genomics API

This is the second in a series of about three posts with notes on different genomics APIs. The first post, which was about the One Codex API, can be found here.

As you may have heard, Google has started building an ambitious infrastructure for storing and querying genomic data, so I was eager to start exploring it. However, as there were a number of tools available, I initially had some trouble wrapping my head around what I was supposed to do. I hope these notes, where I mainly use the API for R, can provide some help.

Some useful bookmarks:

Google Developers Console – for creating and managing Google Genomics and BigQuery projects.

Google Genomics GitHub repo

Google Cloud Platform Google Genomics page (not sure what to call this page really)

Getting started

You should start by going to the Developer Console and creating a project. You will need to give it a name, and in addition it will be given a unique ID which you can use later in API calls. When the project has been created, click “Enable an API” on the Dashboard page, and click the button where it says “OFF” next to Genomics API (you may need to scroll down to find it).

Now you need to create a client_secret.json file that you will use for some API calls. Click the Credentials link in the left side panel and then click “Create new client ID”. Select “Installed application” and fill in the “Consent screen” form. All you really need to do is select an email address and type a “product name”, like “BlogTutorial” like I did for this particular example. Select “Installed application” again if you are prompted to select an application type. Now it should display some information under the heading “Client ID for native application”. Click the “Download JSON” button and rename the file to client_secret.json. (I got these instructions from here.)

Using the Java API client for exploring the data sets

One of the first questions I had was how to find out which datasets are actually available for querying. Although it is perfectly possible to click around in the Developer Console, I think the most straightforward way currently is to use the Java API client. I installed it from the Google Genomics GitHub repo by cloning:
git clone
The GitHub repo page contains installation instructions, but I will repeat them here. You need to compile it using Maven:

cd api-client-java
mvn package

If everything goes well, you should now be able to use the Java API client to look for datasets. It is convenient (but not necessary) to put the client_secret.json file into the same directory as the Java API client. Let’s check which data sets are available (this will only work for projects where billing has been enabled; you can sign up for a free trial in which case you will not be surprise-billed):

java -jar genomics-tools-client-java-v1beta2.jar listdatasets --project_number 761052378059 --client_secrets_filename client_secret.json

(If your client_secret.json file is in another directory, you need to give the full path to the file, of course.) The project number is shown on your project page in the Developer Console. Now, the client will open a browser window where you need to authenticate. You will only need to do this the first time. Finally, the results are displayed. They currently look like this:

Platinum Genomes (ID: 3049512673186936334)
1000 Genomes - Phase 3 (ID: 4252737135923902652)
1000 Genomes (ID: 10473108253681171589)

So there are three data sets. Now let’s check which reference genomes are available:

java -jar genomics-tools-client-java-v1beta2.jar searchreferencesets --client_secrets_filename ../client_secret.json --fields 'referenceSets(id,assemblyId)'

The output is currently:


To find out the names of the chromosomes/contigs in one of the reference genomes: (by default this will only return the ten first hits, so I specify –count 50)

java -jar genomics-tools-client-java-v1beta2.jar searchreferences --client_secrets_filename client_secret.json  --fields 'references(id,name)' --reference_set_id EMWV_ZfLxrDY-wE --count 50

Now we can try to extract a snippet of sequence from one of the chromosomes. Chromosome 9 in hg19 had the ID EIeX4KDCl634Jw, so the query becomes, if we want to extract some sequence from 13 Mbases into the chromosome:

java -jar genomics-tools-client-java-v1beta2.jar getreferencebases  --client_secrets_filename client_secret.json --reference_id ENywqdu-wbqQBA --start 13000000 --end 13000070


Another thing you might want to do is to check which “read groups” that are available in one of the data sets. For instance, for the Platinum Genomes data set we get:

java -jar genomics-tools-client-java-v1beta2.jar searchreadgroupsets --dataset_id 3049512673186936334  --client_secrets_filename client_secret.json

which outputs a bunch of JSON records that show the corresponding sample name, BAM file, internal IDs, software and version used for alignment to the reference genome, etc.

Using BigQuery to search Google Genomics data sets

Now let’s see how we can call the API from R. The three data sets mentioned above can be queried using Google’s BigQuery interface, which allows SQL-like queries to be run on very large data sets. Start R and install and load some packages:

install.packages("devtools") # unless you already have it!

Now we can access BigQuery through R. Try one of the non-genomics data sets just to get warmed up.

project <- '(YOUR_PROJECT_ID)' # the ID of the project from the Developer Console
sql <- 'SELECT title,contributor_username,comment FROM[publicdata:samples.wikipedia] WHERE title contains "beer" LIMIT 100;'
data <- query_exec(sql, project)

Now the data object should contain a list of Wikipedia articles about beer. If that worked, move on to some genomic queries. In this case, I decided I wanted to look at the SNP for the photic sneeze reflex (the reflex that makes people such as myself sneeze when they go out on a sunny day) that 23andme discovered via their user base. That genetic variant has the ID and is located on chromosome 2, base 146125523 in the hg19 reference genome. It seems that 23andme uses a 1-based coordinate system (the first nucleotide has the index 1) while Google Genomics uses a 0-based system, so we should look for base position 146125522 instead. We query the Platinum Genomes variant table: (you can find the available tables at the BigQuery Browser Tool Page)

sql <- 'SELECT reference_bases,alternate_bases FROM[genomics-public-data:platinum_genomes.variants] WHERE reference_name="chr2" AND start=146125522 GROUP BY reference_bases,alternate_bases;'
query_exec(sql, project)

This shows the following output:

reference_bases alternate_bases
1 C T

This seems to match the description provided by 23andme; the reference allele is C and the most common alternate allele is T. People with CC have slightly higher odds of sneezing in the sun, TT people have slightly lower odds, and people with CT have average odds.

If we query for the variant frequencies (VF) in the 13 Platinum genomes, we get the following results (the fraction represents, as I interpret it, the fraction of sequencing reads that has the “alternate allele”, in this case T):

sql <- 'SELECT call.call_set_name,call.VF FROM[genomics-public-data:platinum_genomes.variants] WHERE reference_name="chr2" AND start=146125522;'
query_exec(sql, project)

The output is as follows:

call_call_set_name call_VF
1 NA12882 0.500
2 NA12877 0.485
3 NA12889 0.356
4 NA12885 1.000
5 NA12883 0.582
6 NA12879 0.434
7 NA12891 1.000
8 NA12888 0.475
9 NA12886 0.434
10 NA12884 0.459
11 NA12893 0.588
12 NA12878 0.444
13 NA12892 0.533

So most people here seem to have a mix of C and T, with two individuals (NA12891 and NA12885) having all T:s, in other words they appear to be homozygous for the T allele, if I am interpreting this correctly.

Using the R API client

Now let’s try to use the R API client. In R, install the client from GitHub, and also ggbio and ggplot2 if you don’t have them already:


First we need to authenticate for this R session:

authenticate(file="/path/to/client_secret.json") # substitute the actual path to your client_secret.json file

The Google Genomics GitHub repo page has some examples on how to use the R API. Let’s follow the Plotting Alignments example.

reads <- getReads(readGroupSetId="CMvnhpKTFhDyy__v0qfPpkw",

This will fetch reads corresponding to the given genomic interval (which turns out to overlap a gene called KL) in the read group set called CMvnhpKTFhDyy__v0qfPpkw. By applying one of the Java API calls shown above and grepping for this string, I found out that this corresponds to a BAM file for a Platinum Genomes sample called NA12893.

We need to turn thereadslist into a GAlignment object:

alignments <- readsToGAlignments(reads)

Now we can plot the read coverage over the region using some ggbio functions.

alignmentPlot <- autoplot(alignments, aes(color=strand,fill=strand))
coveragePlot <- ggplot(as(alignments, 'GRanges')) + stat_coverage(color="gray40", fill="skyblue")
tracks(alignmentPlot, coveragePlot, xlab="Reads overlapping for NA12893")

As in the tutorial, why not also visualize the part of the chromosome where we are looking.

ideogramPlot <- plotIdeogram(genome="hg19", subchr="chr13")
ideogramPlot + xlim(as(alignments, 'GRanges'))


Now you could proceed with one of the other examples, for instance the variant annotation comparison example, which I think is a little bit too elaborate to reproduce here.

e-infrastructures for data management and deep learning for NLP

This week, I spent two days at the Workshop on e-Infrastructures for Massively Parallel Sequencing in Uppsala. It was a nice event with sessions on workflow systems, virtual machines, DNA sequencing data management experiences in various countries etc. Rather than summarizing the takeaways, I’d like to point to a Storify made by Surya Saha which includes a lot of tweets from yours truly. I sometimes live-tweet talks from events but I don’t think I’ve ever gotten so many retweets and new followers before. I guess I underestimated how interested people are on workflow management and pipelining systems.

Tonight, I went to a meetup on deep learning and natural language processing hosted by the Stockholm NLP Meetup group. It was a fun session led by Roelof Pieters which included hands-on programming with Theano and some tantalizing glimpses on what you can do with word embeddings and distributed semantics. A recording can be found here and the slides are here. There’s also a GitHub repo with some iPython notebooks containing useful example code in Theano.

Post Navigation