Notes on genomics APIs #3: SolveBio
This is the third in a short series of posts with notes on different genomics APIs. The first post, which was about the One Codex API, can be found here, and the second one, about Google Genomics, can be found here.
SolveBio “delivers the critical reference data used by hospitals and companies to run genomic applications”, according to their web page. They focus on clinical genomics and on helping developers who need to access various data sources in a programmatic way. Their curated data library provides access to (as of February 2015) “over 300 datasets for genomics, proteomics, literature annotation, variant-disease relationships, and more.) Some examples of those datasets are the ClinVar disease gene database from NIH, the Somatic Mutations dataset from The Cancer Genome Atlas, and the COSMIC catalogue of somatic mutations in cancer.
SolveBio offers a RESTful API with Python and Ruby clients already available and an R client under development. The Getting Started Guide really tells you most of what you need to know to use it, but let’s try it out here on this blog anyway!
You should, of course, start by signing up for a free account. After that, it’s time to get the client. I will use the Python one in this post. It can be installed by giving this command:
curl -skL install.solvebio.com/python | bash
You can also install it with pip.
Now you will need to login. This will prompt you for your email and password that you registered when signing up.
At this point you can view a useful tutorial by giving
solvebio tutorial. The tutorial explains the concept of depositories, which are versioned containers for data sets. For instance (as explained in the docs), there is a ClinVar depository which (as of version 3.1.0) has three datasets: ClinVar, Variants, and Submissions. Each dataset within a depository is designed for a specific use-case. For example, the Variants dataset contains data on genomic variants, and supports multiple genome builds.
Now start the interactive SolveBio shell. This shell (in case you followed the instructions above) is based on iPython.
In a similar way, you can view all the data sets with
Dataset.all(latest=True) to view only the latest additions.
To work with a data set, you need to ‘retrieve’ it with a command like:
ds = Dataset.retrieve('ClinVar/3.1.0-2015-01-13/Variants')
It is perfectly possible to leave out the version of the data set:
ds = Dataset.retrieve('ClinVar/Variants') but that is bad practice from a reproducibility viewpoint and is not recommended, especially in production code.
Now we can check which fields are available in the ds object representing the data set we selected.
There are fields for things like alternate alleles for the variant in question, sources of clinical information on the variant, the name of any gene(s) overlapping the variant, and the genomic coordinates for the variant.
You can create a Python iterator for looping through all the records (variants) using
ds.query(). To view the first variant, type
ds.query(). This will give you an idea of how each record (variant) is described in this particular data set. In practice, you will almost always want to filter your query according to some specified criteria. So for example, to look for known pathogenic variants in the titin (TTN) gene, you could filter as follows:
ttn_vars = ds.query().filter(clinical_significance='Pathogenic', gene_symbol_hgnc='TTN')
This will give you an iterator with a bunch of records (currently 18) that you can examine in more detail.
If you want to search for variants in some specified genomic region that you have identified as interesting, you can do that too, but it is only possible for some data sets. In this case it turns out that we can do it with this version of the ClinVar variant data set, because it is considered a “genomic” data set, which we can see because the command
ds.is_genomicreturns True. (Some of the older versions return False here.)
ds.query(genome_build='GRCh37').range('chr3', 22500000, 23000000)
Note that you can specify a genome build in the query, which is very convenient.
Moving on to a different depository and data set, we can search for diabetes-related variants as defined via genome wide association studies with something like the following:
ds = Dataset.retrieve('GWAS/1.0.0-2015-01-13/GWAS')
ds.fields() # Check out which fields are available
ds.query().filter(phenotype='diabetes') # Also works with "Diabetes"
ds.query().filter(journal='science',phenotype='diabetes') # Only look for diabetes GWAS published in Science
Also, giving a command like
Dataset.retrieve('GWAS/1.0.0-2015-01-13/GWAS').help() will open up a web page describing the dataset in your browser.