This is the second in a series of about three posts with notes on different genomics APIs. The first post, which was about the One Codex API, can be found here.
As you may have heard, Google has started building an ambitious infrastructure for storing and querying genomic data, so I was eager to start exploring it. However, as there were a number of tools available, I initially had some trouble wrapping my head around what I was supposed to do. I hope these notes, where I mainly use the API for R, can provide some help.
Some useful bookmarks:
Google Developers Console – for creating and managing Google Genomics and BigQuery projects.
Google Genomics GitHub repo
Google Cloud Platform Google Genomics page (not sure what to call this page really)
Getting started
You should start by going to the Developer Console and creating a project. You will need to give it a name, and in addition it will be given a unique ID which you can use later in API calls. When the project has been created, click “Enable an API” on the Dashboard page, and click the button where it says “OFF” next to Genomics API (you may need to scroll down to find it).
Now you need to create a client_secret.json
file that you will use for some API calls. Click the Credentials link in the left side panel and then click “Create new client ID”. Select “Installed application” and fill in the “Consent screen” form. All you really need to do is select an email address and type a “product name”, like “BlogTutorial” like I did for this particular example. Select “Installed application” again if you are prompted to select an application type. Now it should display some information under the heading “Client ID for native application”. Click the “Download JSON” button and rename the file to client_secret.json. (I got these instructions from here.)
Using the Java API client for exploring the data sets
One of the first questions I had was how to find out which datasets are actually available for querying. Although it is perfectly possible to click around in the Developer Console, I think the most straightforward way currently is to use the Java API client. I installed it from the Google Genomics GitHub repo by cloning:
git clone git@github.com:googlegenomics/api-client-java.git
The GitHub repo page contains installation instructions, but I will repeat them here. You need to compile it using Maven:
cd api-client-java
mvn package
If everything goes well, you should now be able to use the Java API client to look for datasets. It is convenient (but not necessary) to put the client_secret.json file into the same directory as the Java API client. Let’s check which data sets are available (this will only work for projects where billing has been enabled; you can sign up for a free trial in which case you will not be surprise-billed):
java -jar genomics-tools-client-java-v1beta2.jar listdatasets --project_number 761052378059 --client_secrets_filename client_secret.json
(If your client_secret.json file is in another directory, you need to give the full path to the file, of course.) The project number is shown on your project page in the Developer Console. Now, the client will open a browser window where you need to authenticate. You will only need to do this the first time. Finally, the results are displayed. They currently look like this:
Platinum Genomes (ID: 3049512673186936334)
1000 Genomes - Phase 3 (ID: 4252737135923902652)
1000 Genomes (ID: 10473108253681171589)
So there are three data sets. Now let’s check which reference genomes are available:
java -jar genomics-tools-client-java-v1beta2.jar searchreferencesets --client_secrets_filename ../client_secret.json --fields 'referenceSets(id,assemblyId)'
The output is currently:
{"assemblyId":"GRCh37lite","id":"EJjur6DxjIa6KQ"}
{"assemblyId":"GRCh38","id":"EMud_c37lKPXTQ"}
{"assemblyId":"hs37d5","id":"EOSt9JOVhp3jkwE"}
{"assemblyId":"GRCh37","id":"EOSsjdnTicvzwAE"}
{"assemblyId":"hg19","id":"EMWV_ZfLxrDY-wE"}
To find out the names of the chromosomes/contigs in one of the reference genomes: (by default this will only return the ten first hits, so I specify –count 50)
java -jar genomics-tools-client-java-v1beta2.jar searchreferences --client_secrets_filename client_secret.json --fields 'references(id,name)' --reference_set_id EMWV_ZfLxrDY-wE --count 50
Now we can try to extract a snippet of sequence from one of the chromosomes. Chromosome 9 in hg19 had the ID EIeX4KDCl634Jw, so the query becomes, if we want to extract some sequence from 13 Mbases into the chromosome:
java -jar genomics-tools-client-java-v1beta2.jar getreferencebases --client_secrets_filename client_secret.json --reference_id ENywqdu-wbqQBA --start 13000000 --end 13000070
This returns the sequence AGGGACAGGAATTGAGATTTAGGAAGCCATCAGGACTTGGGTTTTTGTCATCCCACTCTATTTCTCTCTG
.
Another thing you might want to do is to check which “read groups” that are available in one of the data sets. For instance, for the Platinum Genomes data set we get:
java -jar genomics-tools-client-java-v1beta2.jar searchreadgroupsets --dataset_id 3049512673186936334 --client_secrets_filename client_secret.json
which outputs a bunch of JSON records that show the corresponding sample name, BAM file, internal IDs, software and version used for alignment to the reference genome, etc.
Using BigQuery to search Google Genomics data sets
Now let’s see how we can call the API from R. The three data sets mentioned above can be queried using Google’s BigQuery interface, which allows SQL-like queries to be run on very large data sets. Start R and install and load some packages:
install.packages("devtools") # unless you already have it!
library("devtools")
devtools::install_github("hadley/assertthat")
devtools::install_github("hadley/bigrquery")
library("bigrquery")
Now we can access BigQuery through R. Try one of the non-genomics data sets just to get warmed up.
project <- '(YOUR_PROJECT_ID)' # the ID of the project from the Developer Console
sql <- 'SELECT title,contributor_username,comment FROM[publicdata:samples.wikipedia] WHERE title contains "beer" LIMIT 100;'
data <- query_exec(sql, project)
Now the data object should contain a list of Wikipedia articles about beer. If that worked, move on to some genomic queries. In this case, I decided I wanted to look at the SNP for the photic sneeze reflex (the reflex that makes people such as myself sneeze when they go out on a sunny day) that 23andme discovered via their user base. That genetic variant has the ID and is located on chromosome 2, base 146125523 in the hg19 reference genome. It seems that 23andme uses a 1-based coordinate system (the first nucleotide has the index 1) while Google Genomics uses a 0-based system, so we should look for base position 146125522 instead. We query the Platinum Genomes variant table: (you can find the available tables at the BigQuery Browser Tool Page)
sql <- 'SELECT reference_bases,alternate_bases FROM[genomics-public-data:platinum_genomes.variants] WHERE reference_name="chr2" AND start=146125522 GROUP BY reference_bases,alternate_bases;'
query_exec(sql, project)
This shows the following output:
reference_bases alternate_bases
1 C T
This seems to match the description provided by 23andme; the reference allele is C and the most common alternate allele is T. People with CC have slightly higher odds of sneezing in the sun, TT people have slightly lower odds, and people with CT have average odds.
If we query for the variant frequencies (VF) in the 13 Platinum genomes, we get the following results (the fraction represents, as I interpret it, the fraction of sequencing reads that has the “alternate allele”, in this case T):
sql <- 'SELECT call.call_set_name,call.VF FROM[genomics-public-data:platinum_genomes.variants] WHERE reference_name="chr2" AND start=146125522;'
query_exec(sql, project)
The output is as follows:
call_call_set_name call_VF
1 NA12882 0.500
2 NA12877 0.485
3 NA12889 0.356
4 NA12885 1.000
5 NA12883 0.582
6 NA12879 0.434
7 NA12891 1.000
8 NA12888 0.475
9 NA12886 0.434
10 NA12884 0.459
11 NA12893 0.588
12 NA12878 0.444
13 NA12892 0.533
So most people here seem to have a mix of C and T, with two individuals (NA12891 and NA12885) having all T:s, in other words they appear to be homozygous for the T allele, if I am interpreting this correctly.
Using the R API client
Now let’s try to use the R API client. In R, install the client from GitHub, and also ggbio and ggplot2 if you don’t have them already:
source("http://bioconductor.org/biocLite.R")
biocLite("ggbio")
devtools::install_github("googlegenomics/api-client-r")
install.packages("ggplot2")
library("GoogleGenomics")
library("ggbio")
library("ggplot2")
First we need to authenticate for this R session:
authenticate(file="/path/to/client_secret.json") # substitute the actual path to your client_secret.json file
The Google Genomics GitHub repo page has some examples on how to use the R API. Let’s follow the Plotting Alignments example.
reads <- getReads(readGroupSetId="CMvnhpKTFhDyy__v0qfPpkw",
chromosome="chr13",
start=33628130,
end=33628145)
This will fetch reads corresponding to the given genomic interval (which turns out to overlap a gene called KL) in the read group set called CMvnhpKTFhDyy__v0qfPpkw. By applying one of the Java API calls shown above and grepping for this string, I found out that this corresponds to a BAM file for a Platinum Genomes sample called NA12893.
We need to turn thereads
list into a GAlignment object:
alignments <- readsToGAlignments(reads)
Now we can plot the read coverage over the region using some ggbio functions.
alignmentPlot <- autoplot(alignments, aes(color=strand,fill=strand))
coveragePlot <- ggplot(as(alignments, 'GRanges')) + stat_coverage(color="gray40", fill="skyblue")
tracks(alignmentPlot, coveragePlot, xlab="Reads overlapping for NA12893")
As in the tutorial, why not also visualize the part of the chromosome where we are looking.
ideogramPlot <- plotIdeogram(genome="hg19", subchr="chr13")
ideogramPlot + xlim(as(alignments, 'GRanges'))
Now you could proceed with one of the other examples, for instance the variant annotation comparison example, which I think is a little bit too elaborate to reproduce here.