Follow the Data

A data driven blog

Archive for the tag “metagenomics”

Notes on genomics APIs #1: One Codex

This is the first in a series of about three posts with notes on different genomics APIs.

One Codex calls itself “a genomic search engine, enabling new and valuable applications in clinical diagnostics, food safety, and biosecurity”. They have built a data platform where you can rapidly (much more quickly than with e.g. BLAST) match your sequences against an indexed reference database containing a large collection of bacterial, viral and fungal genomes. They have a good web interface for doing the search but have also introduced an API. I like to use command-line APIs in order to wrap things into workflows, so I decided to try it. Here are some notes on how you might use it.

This service could be useful when you want to identify contamination or perhaps the presence of some infectious agent in a tissue sample, but the most obvious use case is perhaps for metagenomics (when you have sequenced a mixed population of organisms). Let’s go to to the EBI Metagenomics site, which keeps a directory of public metagenomics data sets. Browsing through the list of projects, we see an interesting looking one: the Artisanal Cheese Metagenome. Let’s download one of the sequence files for that. Click the sample name (“Artisanal cheeses”), then click the Download tab. Now click “Submitted nucleotide reads (ENA website)”. There are two gzipped FASTQ files here – I arbitrarily choose to download the first one [direct link]. This download is 353 Mb and took about 10 minutes on my connection. (If you want a lighter download, you could try the 100 day old infant gut metagenome which is only about 1 Mb in size.)

The artisanal cheese metagenome file contains about 2 million sequences. If you wanted to do this analysis properly, you would probably want to run some de novo assembly tool which is good at metagenomics assembly such as IDBA-UD, Megahit, etc on it, but since my aim here is not to do a proper analysis but just show how to use the One Codex API, I will just query One Codex with the raw sequences.

I am going to use the full data set of 2M sequences. However, if you want to select a subset of let’s say 10,000 sequences in order to get results a bit faster, you could do like this:

gzcat Sample2a.fastq.gz | tail +4000000 | head -40000 > cheese_subset.fastq

(Some explanation is in order. In a FASTQ file, each sequence entry consists of four lines. Thus, we want to pick 40,000 lines in order to get 10,000 sequences. The tail +4000000 part of the command makes the selection start 1 million sequences into the file, that is, at 4 million lines. I usually avoid taking the very first sequences when choosing subsets of FASTQ files, because there are often poor sequences there due to edge effects in the sequencer flow cells. So now you would have selected 10,000 sequences from approximately the middle of the file.)

Now let’s try to use One Codex to see what the artisanal cheese metagenome contains. First, you need to register for a One Codex account, and then you need to apply for an API key (select Request a Key from the left hand sidebar).

You can use the One Codex API via curl, but there is also a convenient Python-based command-line client, which, however, only seems to work with Python 2 so far (a Python 3 version is under development). If you don’t want to use Python 2 (which should be easy enough using virtual environments), you’ll have to refer to the API documentation for how to do the curl calls. In these notes, I will use the command-line client. The installation should be as easy as:

pip install onecodex

Now we can try to classify the contents of our sample. In my case, the artisanal cheese metagenome file is called Sample2a.fastq.gz. We can query the One Codex API with gzipped files, so we don’t need to decompress it. First we need to be authenticated (at this point I am just following the tutorial here):

onecodex login

You will be prompted for your API key, which you’ll find under the Settings on the One Codex web site.

You can now list the available commands:

onecodex --help

which should show something like this:

usage: onecodex [-h] [--no-pretty-print] [--no-threads] [--max-threads N]
[--api-key API_KEY] [--version]
{upload,samples,analyses,references,logout,login} ...
One Codex Commands:
upload Upload one or more files to the One Codex platform
samples Retrieve uploaded samples
analyses Retrieve performed analyses
references Describe available Reference databses
logout Delete your API key (saved in ~/.onecodex)
login Add an API key (saved in ~/.onecodex)
One Codex Options:
-h, --help show this help message and exit
--no-pretty-print Do not pretty-print JSON responses
--no-threads Do not use multiple background threads to upload files
--max-threads N Specify a different max # of N upload threads
(defaults to 4)
--api-key API_KEY Manually provide a One Codex Beta API key
--version show program's version number and exit

Upload the sequences to the platform:

onecodex upload Sample2a.fastq.gz

This took me about five minutes – if you are using a small file like the 100-day infant gut metagenome it will be almost instantaneous. If we now give the following command:

onecodex analyses

it will show something similar to the following:

"analysis_status": "Pending",
"id": "6845bd3fa31c4c09",
"reference_id": "f5a3d51131104d7a",
"reference_name": "RefSeq 65 Complete Genomes",
"sample_filename": "Sample2a.fastq.gz",
"sample_id": "d4aff2bdf0db47cd"
"analysis_status": "Pending",
"id": "974c3ef01d254265",
"reference_id": "9a61796162d64790",
"reference_name": "One Codex 28K Database",
"sample_filename": "Sample2a.fastq.gz",
"sample_id": "d4aff2bdf0db47cd"

where the “analysis_status” of “Pending” indicates that the sample is still being processed. There are two entries because the sequences are being matched against two databases: the RefSeq 65 Complete Genomes and the One Codex 28K Database. According to the web site, “The RefSeq 65 Complete Genomes database […] includes 2718 bacterial genomes and 2318 viral genomes” and the “expanded One Codex 28k database includes the RefSeq 65 database as well as 22,710 additional genomes from the NCBI repository, for a total of 23,498 bacterial genomes, 3,995 viral genomes and 364 fungal genomes.”

After waiting for 10-15 minutes or so (due to some very recently added parallelization capabilities it should only take 4-5 minutes now), the “analysis_status” started showing “Success”. Now we can look at the results. Let’s check out the One Codex 28K database results. You just need to call onecodex analyses with the “id” value shown in one of the outputs above.

bmp:OneCodex mikaelhuss1$ onecodex analyses 974c3ef01d254265
"analysis_status": "Success",
"id": "974c3ef01d254265",
"n_reads": 2069638,
"p_mapped": 0.21960000000000002,
"reference_id": "9a61796162d64790",
"reference_name": "One Codex 28K Database",
"sample_filename": "Sample2a.fastq.gz",
"sample_id": "d4aff2bdf0db47cd",
"url": ""

So One Codex managed to assign a likely source organism to about 22% of the sequences. There is a URL to the results page. This URL is by default private to the user who created the analysis, but One Codex has recently added functionality to make results pages public if you want to share them, so I did that: Artisanal Cheese Metagenome Classification. Feel free to click around and explore the taxonomic tree and the other features.

You can also retrieve your analysis results as a JSON file:

onecodex analyses 974c3ef01d254265 --table > cheese.json

We see that the most abundantly detected bacterium in this artisanal cheese sample was Streptococcus macedonicus, which makes sense as that is a dairy isolate frequently found in fermented dairy products such as cheese.

Compressive sensing and bioinformatics

Compressive (or compressed) sensing is (as far as I can tell; I have learned everything I know about it from the excellent Nuit Blanche blog) a relatively newly named research subject somewhere around the intersection of signal processing, linear algebra and statistical learning. Roughly speaking, it deals with problems like how to reconstruct a signal from the smallest possible number of measurements and how to complete a matrix (to fill in missing matrix elements – if you think that sounds kind of abstract, this¬† presentation on collaborative filtering by Erik Bernhardssons nicely explains how Spotify uses matrix completion for song recommendations; NetFlix and others also do similar things). The famous mathematician Terence Tao, who has done a lot of work in developing compressed sensing, has much better explanations than I can give, for instance this 2007 blog post that exemplifies the theory with the “single-pixel camera” and this PDF presentation where he explains how CS relates to standard linear (Ax = b) equation systems that are under-determined (more variables than measurements). Igor Carron of the Nuit Blanche blog also has a “living document” with a lot of explanations about what CS is about (probably more than you will want to refer to in a single sitting).

Essentially, compressed sensing works well when the signal is sparse in some basis (this statement will sound much clearer if you have read one of the explanations I linked above). The focus on under-determined equation systems made me think about whether this could be useful for bioinformatics, where we frequently encounter the case where we have few measurements of a lot of variables (think of, e.g., the difficulty of obtaining patient samples, and the large number of gene expression measurements that are taken from them). The question is, though, whether gene expression vectors (for example) can be thought of as sparse in some sense. Another train of thought I had is that it would be good to develop even more approximate and, yes, compressive methods for things like metagenomic sequencing, where the sheer amount of information pretty quickly starts to break the available software. (C Titus Brown is one researcher who is developing software tools to this end.)

Of course, I was far from being the first one to make the connection between compressive sensing and bioinformatics. Olgica Milenkovic had thoughtfully provided a presentation on sparse problems in bioinformatics (problems that thus could be addressable with CS techniques).

Apart from the applications outlined in the above presentation, I was excited to see a new paper about a CS approach to metagenomics:

Quikr – A method for rapid reconstruction of bacterial communities using compressed sensing

Also there are a couple of interesting earlier publications:

A computational model for compressed sensing RNAi cellular screening

Compressive sensing DNA microarrays

Large-scale machine learning course & streaming organism classification

The NYU Large Scale Machine Learning course looks like it will be very worthwhile to follow. The instructors, John Langford and Yann Le Cun, are both key figures in the machine learning field – for instance, the former developed Vowpal Wabbit and the latter has done pioneering work in deep learning. It is not an online course like those at Coursera et al., but they have promised to put lecture videos and slides online. I’ll certainly try to follow along as best I can.


There is an interesting Innocentive challenge going on, called “Identify Organisms from A Stream of DNA Sequences.” This is interesting to me both because of the subject matter (classification based on DNA sequences) and also because the winner is explicitly required to submit an efficient, scalable solution (not just a good classifier.) Also, the prize sum is one million US dollars! It’s exactly this kind of algorithms that will be needed to enable the “genomic observatories” that I have mentioned before on this blog which will continuously stream out sequences obtained from the environment.

Post Navigation