Follow the Data

A data driven blog

Archive for the category “Tools and Software”

Follow the Data podcast, episode 3: Grokking Big Data with Paco Nathan

In this third episode of the Follow the Data podcast we talk to Paco Nathan, Data Scientist at Concurrent Inc.

Podcast link: http://s3.amazonaws.com/follow_the_data/FollowTheData_03_Podcast.mp3

Paco’s blog: http://ceteri.blogspot.se/

The running time is about one hour.

Paco’s internet connection died just as we were about to start the podcast so he had to connect via Skype on the iPhone. We apologize on the behalf of his internet provider in Silicon Valley for the reduced sound quality caused by this.

Here’s a few links to stuff we discussed:

http://www.cascading.org/
An application framework for Java developers to quickly and easily develop robust Data Analytics and Data Management applications on Apache Hadoop.

http://clojure.org/
A dialect of Lisp that runs on the JVM.

https://github.com/twitter/scalding
A Scala library that makes it easy to write MapReduce jobs in Hadoop.

http://www.cascading.org/multitool/
A simple command line interface for building large-scale data processing jobs based on Cascading.

http://en.wikipedia.org/wiki/CAP_theorem
states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency, Availability, Partition tolerance

http://www.nature.com/news/nanopore-genome-sequencer-makes-its-debut-1.10051
an article on the USB-sized Oxford Nanopore MinION sequencer

http://datakind.org/
Previously known as Data Without Borders this organisation aims to do good with Big Data.

http://www.climate.com/
Prediction based insurance for farmers.

wikipedia.org All_Watched_Over_by_Machines_of_Loving_Grace_(TV_series)
An interesting take on how programming culture has affected life. Link to episode #2 (http://vimeo.com/29875053)  “The use and abuse of vegetational concepts” – about how the idea of ecosystems came to be, sprung out of the notion of harmony in nature, how this influenced cybernetics and the perils of taking this animistic concept too far.

http://scratch.mit.edu/
A great way to teach kids to code.

http://www.stencyl.com/
Another interesting tool for teaching kids to code and build games.

http://www.minecraft.net/
Free form virtual reality game.

http://www.yelloworb.com/orbblog/
Some info on arduino-based wireless wind measurement project by Karl-Petter Åkesson (in Swedish).

http://www.fringeware.com/
A pioneering internet retailer that Paco was one of the founders for.

What can “big data” (read “Hadoop”) do for genomics?

Prompted by the recent news that Cloudera and Mount Sinai School of Medicine will collaborate to “solve medical challenges using big data” (more specifically, Cloudera’s Jeff Hammerbacher, ex-big data guru at Facebook, will collaborate with the equally trailblazing mathematician/biologist Eric Schadt at Mount Sinai’s Institute for Genomics and Multiscale Biology) and that NextBio will collaborate with Intel to “optimize the Hadoop stack and advance big data technologies in medicine”, I would like to offer some random thoughts on possible use cases.

Note that “big data” essentially means “Hadoop” in the above press releases, and that the “medicine” they mention should be understood as “genomic medicine” or just “genomics”. Since I happen to know a thing or two about genomics, I will limit myself to (parts of) genomics and Hadoop/MapReduce in this post. For a good overview of big data and medicine in a broader sense than I can describe here, check out this rather nice GigaOm article.

Existing Hadoop/MapReduce stuff for NGS

In the world of high-throughput, or next-generation sequencing (NGS), which is rapidly becoming more and more indispensable for genomics, there are a few Hadoop-based frameworks that I am aware of and that should probably be mentioned first. Packages like Cloudburst and Crossbow leverage Hadoop to perform “read mapping” (approximate string matching for taking a DNA sequence from the sequencer and figuring out where in a known genome it came from), Myrna and Eoulsan do the same but also extend the workflow to quantifying gene expression and identifying differentially expressed genes based on the sequences, and Contrail does Hadoop-based de novo assembly (piecing together a new genome from sequences without previous knowledge, like an extremely difficult jigsaw puzzle). These are essentially MapReduce implementations of existing software, which is all good and fine, but I haven’t seen these tools being used much so far. Perhaps one reason is that read mapping is usually not a major bottleneck compared to some other steps, and with recently released software such as SeqAlto and SNAP (thx Tom Dyar) (and another package that I’m sure I read about the other day but can’t seem find right now) promising a further 10x-100x speed increase compared to existing tools, there is just not a pressing need at the moment. Contrail, the de novo assembler,  does offer an opportunity for research groups who don’t have access to a very RAM-rich computers (de novo assembly is notoriously memory hungry, with 512 Gb RAM machines often being strained to the limit on certain data sets) to perform assembly on commodity clusters.

Then there are the projects that attempt to build a Hadoop infrastructure for next-generation sequencing, like Seal, which provides “map-reducification” for a number of common NGS operations, or Hadoop-BAM (a library for processing BAM files, a common sequence alignment format, in Hadoop) and SeqPig (a library with import and export functions to allow common bioinformatics formats to be used in Pig).

What Hadoop could be useful for

I’m sure people smarter than me will come up with many different use cases for Hadoop in genomics and medicine. At this point, however, I would suggest these general themes:

  • Statistical associations between various kinds of data vectors – clinical, environmental, molecular, microbial... This is more or less a batch-processing problem and thus suited to Hadoop. NextBio (the company mentioned in the beginning, who are teaming up with Intel) are doing this as a core part of their business; computing correlations between gene expression levels in different tissues, diseases and conditions and clinical information, drug data etc. However, this concept could (and should) be extended to other things like environmental information, lifestyle factors, genetic variants (SNV, structural variations, copy number variations etc.), epigenetic data (chromatic structure, DNA methylation, histone modifications …), personal microbiomes (the gut microbiota in each patient etc.) Of course, collecting and compiling the data to perform these correlations will be hard; a much harder “big data” problem than computing the actual correlations.  SolveBio is a new company that seems to want to understand cancer by compiling vast quantities of data in such a way. This is how they put it in an interview (titled, ambitiously, “The Cloud Will Cure Cancer“): “Patients can measure every feature, as the technology becomes cheaper: genome sequence, gene expression in every accessible tissue, chromatin state, small molecules and metabolites, indigenous microbes, pathogens, etc. These data pools can be created by anyone who has the consent of the patients: universities, hospitals, or companies. The resulting networks, the “data tornado”, will be huge. This will be a huge amount of data and a huge opportunity to use statistical learning for medicine.” In fact, a third recently announced bigdata/genomics collaboration, between Google and the Institute for Systems Biology (ISB), has already started to explore what this type of tools could look like in their Cancer Regulome Explorer. ISB has used the Google Compute Engine to scale a random forest algorithm to 600,000 cores across Google’s global data centers in order to “explore associations between DNA, RNA, epigenetic, and clinical cancer data.” See this case study for some more details (not many more to be honest.)
  • Metagenomics. This means, according to one definition, “the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species.(There is really nothing “meta” about it, it’s just that you are looking at many species at once, which is why it is also called environmental genomics or community genomics in some cases.) For example, Craig Venter’s project to sequence as many living things as possible in the Sargasso sea is metagenomics, as is sequencing samples from the human gut, snot etc. in search of novel bacteria, viruses and fungi (or just characterizing the variety of known ones.) It’s a fascinating field; for an easy introduction, see the TED Talk called “What’s left to explore?” by Nathan Wolfe. Analyzing sequences from metagenomics projects is of course much more difficult than usual, because you are randomly sampling sequences for which you don’t know the source organism but have to infer it in some way. This calls for smart use of proper data structures for indexing and querying, and as much parallelization as possible, very likely in some Hadoopy kind of way. C Titus Brown has written a lot of interesting stuff about the metagenomics data deluge on his blog, Living in an Ivory Basement, where he has explored esoteric and useful things such as probabilistic de Bruijn graphs. Lately, compressive genomics - algorithms that compute directly on compressed genomic data - has become something of a buzz phrase (although similar ideas have been used for quite some time). Some combination of all of these approaches will be needed to combat the inevitable information overload.

Beyond batch processing

In my mind, Hadoop has been associated with batch processing, but today I heard that the newest version of Hadoop not only includes a completely overhauled version of MapReduce called YARN, but it will even allow using other kinds of frameworks, such as streaming real-time analytics frameworks, to operate on the data stored in HDFS. I’ve been thinking about possible applications of stream analytics in next-generation sequencing. Surprisingly, there is already software for streaming quantification of sequences, eXpress - these guys are surely ahead of their time. The immediate use case I can think of is for the USB-stick-sized MinION nanopore sequencer, which reportedly will produce output in a real-time manner (which no sequencers do today as far as I know) so that you can start your analysis while the sequencer is still running. If the vision about “genomic observatories” to “take the planet’s biological pulse” comes true, I’m sure there will be plenty of work to do for the stream analytics clusters of the world …

This has been a rambling post that will probably need a few updates in the coming days – congratulations and thanks if you made it to the end!

MLDemos visualizes what classifiers do

MLDemos is based on a really nice idea – to visualize how different classifiers construct the decision boundaries around arbitrary sets of data points. I had of course seen the concept of decision boundaries before; in many machine-learning classes you will draw or at least get to see boundaries or surfaces that delineate the parts of the sample space where a classifier will yield different predictions. In MLDemos, you get to draw the points in the (2-D) sample space by hand, and you can choose between a variety of different algorithms. Or if you want, you can upload your own data sets. The software doesn’t just do decision boundaries, it also visualizes regression, clustering and dynamical systems in cool and downright beautiful ways.

Google Prediction API open to all

I’ve been eagerly waiting to use the Google Prediction API ever since it was announced, and now (since sometime in May) it’s open for everyone who has a Google account (and a credit card). Previously, you had to be able to provide a U.S. mailing address.

Google’s Prediction API is basically a nice way to run your classification and/or prediction tasks through Google’s black-box set of machine learning tools. The way it works is that you upload your training data to Google Storage, which is something like Google’s version of Amazon’s S3: a cloud-based storage system where you store your data in “buckets”. (Google Storage, like S3, uses the term bucket and, also like S3, requires that bucket names only use lower-case letters.) You can activate both Google Storage and the Prediction API from the Google APIs Console. This is also where you will find (click “API access” on the left hand menu) the access key that you will need to run prediction tasks. You’ll have to give credit card details to pay for potential future usage.

The training examples that you put in Storage need to be formatted according to the specification in the Developer’s Guide. Once they have been uploaded, you can train a model on the uploaded data, make predictions about new examples, update existing models and more using one of the client libraries or even simpler, just by copying some of the bash scripts shown on the same page (hidden behind ‘+’ signs which can be expanded.) For these bash scripts to work as written on that page, you need to paste your API key into a file called ‘googlekey’ located in the directory from where you are running the script.

I used this walkthrough example about cancer classification from gene expression data to get up to speed on how Google Prediction API works. Now I’m thinking about what data to throw at it next. Perhaps it would be fun to input some Kaggle contest data sets into it as a kind of “Google baseline” predictor? :-)

RStudio

I’m not normally a big user of IDEs, but I have to say that the new RStudio is pretty slick. It’s a free, open-source IDE for R and looks a bit like the Matlab IDE with a tabbed interface for convenient access to variables and objects, plots and data tables. RStudio runs on Mac, Linux and Windows or on a server, where it can be accessed remotely through a web browser. A nice touch is that it supports Sweave and TeX document creation, although I haven’t tested either of those yet. Maybe now’s the time to learn some Sweave. I started to use RStudio yesterday and I think it will replace the Mac GUI for R that I have been using. The latter is all right but a bit too disjointed when you start plotting and editing several files at once.

The Neural Phone, Darwin phones and Ali Baba’s data treasure

Just the other day, I listened to a podcast about mobile sensing, where professor Andrew Campbell from the mobile sensing group at Dartmouth College talked about a lot of interesting stuff. In particular, I was captivated by his description of the Neural Phone. Apparently, the idea for it was born when Campbell was out jogging and wanted to be able to phone his wife (or a friend, I don’t remember) without touching the phone. Eventually, he and his group managed to put together an iPhone with a cheap EEG headset so that a particular type of “brain wave”, a so-called P300 potential, could be detected by the headset and used to control the phone. In an app called “Dial Tim”, they demonstrated that you could dial an iPhone contact by thinking (producing a P300 potential) when that person was shown on the phone. A lightweight classifier is used to detect the signal corresponding to the desire to call a certain person. It should be noted that according to this interesting paper, the classifier is still pretty sensitive to the person’s state (sitting, standing etc.)

This opens up possibilities for really wild “mind reading” applications; the authors mention the possibility of sensing the aggregate mood in a room from this kind of neural signals, and also how a foreign language teacher could get real-time statistics on the students’ comprehension from EEG data and thus always know how many of them that actually understood a question.

Of course, there is a sinister aspect to this, namely that stray neural signals “in the wild” could be detected by malicious others. The authors call this scenario, which arises from the way the neural information is transmitted in unencrypted IP packets between iPhones, “neural packets everywhere.”

Intrigued by this, I looked up some other work that Campbell and his group has done on mobile phones and classification. This paper deals with “Darwin phones”; a framework for “collaborative sensing” and classification using mobile phones. The authors state that to the best of their knowledge, “Darwin is the first system that applies distributed machine learning techniques and collaborative inference concepts to mobile phones.” The paper contains a number of cool ideas, like “model pooling”, where phones that are close to each other can “borrow” trained classification models from each other, and “collaborative inference”, when a group of phones combine the predictions from their respective model to a potentially more robust overall prediction, which is less sensitive to particularities such as background noise specific to each phone’s location. This way of boosting predictions by using different models to reduce noise is reminiscent of how ensemble models are used in machine learning. The concepts of model pooling and collaborative inference are very useful, because it is typically quite time-consuming to train a classification model on a mobile phone; people don’t like to be bothered to provide labels to training examples.

In short, classification on (clusters of) mobile phones seems to be a really interesting problem!

Completely unrelated but still interesting was a recent Economist article about Alibaba, the Chinese site that matches up producers in China with foreign buyers, eliminating middlemen. I had been vaguely aware of this site and sometimes idly wondered about its business model, and here the Economist suggests that the company is sitting on some really valuable data about how creditworthy small companies are, which companies that know each other, and in general how Chinese middle-class people spend their money. It must be an interesting data set for sure.

Phylo – an alignment game

I’ve been playing some Phylo while snowed in during this weekend. This nifty game, developed by a group at McGill University in Canada, reminds me a lot of FoldIt, which I’ve mentioned several times on this blog. Like FoldIt, Phylo works well just as a logic/pattern-recognition game, but also has a hidden (well, actually not hidden at all) agenda; it tries to apply the strategies used by the (most skillful) players to actual scientific problems. In the case of Phylo, the problem that you are trying to solve is multiple sequence alignment, or described more simply, trying to match up DNA sequences from different species to each other. Multiple sequence alignment is one of the truly classic problems in bioinformatics, and there are many good algorithms for it, but these could still be improved. The idea of Phylo is to leverage human beings’ superior pattern recognition capabilities to solve really tricky multiple alignment problems. Related (or presumably related) DNA sequences from various organisms have already been matched up against each other (aligned) by an existing algorithm, and the idea is that human players may be able to further optimize the alignments “by eye”.

I think there are two things that are really cool about this game. The first thing is that the creators are actually picking the problems from a public resource, the UCSC Genome Browser, where they have located a number of poorly aligned stretches of DNA close to genes (stretches in so-called “promoter regions”). These are regions for which one might suspect that the best alignment hasn’t been found. Also, these regions are interesting from a disease perspective, and each task in Phylo has to do with a certain disease or type of disease.

The second thing that I like is the educational aspect of the game. I’ve studied alignment algorithms (a long time ago), and even though I knew about the scoring schemes on a theoretical level, I hadn’t really understood them in a tangible way before I played Phylo. It’s funny how a game with scores makes you motivated to understand how something works. If I was teaching on a bioinformatics course, I would not hesitate to have the students play Phylo in conjunction with the material on sequence alignment. Never mind the exam, just solve level 9 and you’ve passed the course!

Food and health data set

I stumbled into an amazing dataset about food and health, available online here (Google spreadsheet) and described at the Canibais e Reis blog. I found it through the Cluster analysis of what the world eats blog post, which is cool, but which doesn’t go into the health part of the dataset. By the way, the R code used that blog post is useful for learning how to plot things onto a map of the world in R (and it calculates the most deviant food habits in Mexico and USA as a bonus). Also note the first line:

diet<-read.csv(“http://spreadsheets.google.com/pub?key=tdzqfp-_ypDqUNYnJEq8sgg&single=true&gid=0&output=csv&#8221;)

which reads the data set directly from an URL into an R data structure, ready to be manipulated. I think it’s pretty neat, but then I am easily impressed.

The Canibais e Reis author was interested in data on the relationship between nutrition, lifestyle and health worldwide, but those data were dispersed over various sources and used different formats. He therefore (heroically) combined information from sources like the FAO Statistical Yearbook (for world nutrition data), the British Heart Foundation (for world heart-related, diabetes, obesity, cholesterol etc. disease statistics) and the WHO Global Health Atlas and WHO Statistical Information System (for general world health statistics like mortality, sanitation, drinking water, etc.) After cleaning up the data set and removing incomplete entries, he ended up with a complete matrix of 101 nutrition, health and lifestyle variables for 86 countries. Let the mining begin!

As the blog post describing the data points out, there’s bound to be a lot of confounding variables and non-independence in the data set, so it would be a good idea to apply tools like PCA (see e.g. the recent article Principal Components for Modeling), canonical correlation analysis or something similar to it as a pre-processing step. I haven’t had time to do more than fiddle around a bit – for example, I ran a quick PCA on the food related part of the matrix to try to find out the major direction of variation in world diets. The first principal component (which, at 19.8%, is not very dominant) reflects a division between rice eating countries and “meat and wheat” countries with high consumption of animal products, wheat, meat and sugar.
Canibais e Reis provides a dynamic Excel file where some different types of analysis have been performed. It’s fun to explore the unexpected correlations (or absent correlations) that pop up (the worksheets BEST and WORST in the Excel file). One surprising finding that emerges is that cholesterol is not correlated to cardiovascular disease across this data set (in fact there is a slight negative correlation).

My favourite finding, though, is that cheese consumption is not correlated to death from non-communicable diseases or cardiovascular diseases. Those correlations may be massively influenced by confounding variables, but they are negative enough that I choose to continue chomping on those cheeses …

Data services

There’s been a little hiatus here as I have been traveling. I recently learned that Microsoft has launched Codename “Dallas”, a service for purchasing and managing datasets and web services. It seems they are trying to provide consistent APIs to work with different data from the public and private sectors in a clean way. There’s an introduction here.

This type of online data repository seems to be an idea whose time has arrived – I have previously talked about resources like Infochimps, Datamob and Amazon’s Public Data Sets, and there is also theinfo.org, which I seem to have forgotten to mention. A recent commenter on this blog pointed me to the comprehensive knowledge archive network, which is a “registry of open data and content packages”. Then there are the governmental and municipal data repositories, such as data.gov.

Another interesting service, which may have a slightly different focus, is Factual, described by founder Gil Elbaz as a “platform where anyone can share and mash open data“. Factual basically wants to list facts, and puts the emphasis on data accuracy, so you can express opinions on and discuss the validity of any piece of data. Factual also claims to have “deeper data technology” which allows users to explore the data in a more sophisticated way compared to other services like the Amazon Open Data Sets, for instance.

Companies specializing in helping users make sense out of massive data sets are, of course, popping up as well. I have previously written about Good Data, and now the launch of a new seemingly similar company,  Data Applied, has been announced.  Like Good Data, Data Applied offers affordable licenses for cloud-based and social data analysis, with a free trial package (though Good Data’s free version seems to offer more – a 10 MB data warehouse and 1-5 users vs Data Applied’s file size of <100 kb for a single user; someone correct me if I am wrong). The visualization capabilities of Data Applied do seem very nice. It’s still unclear to me how different the offerings of these two companies are but time will tell.

Mining data streams, the web, and the climate

I recently came across MOA (Massive Online Analysis), an environment for what its developers call massive data mining, or data stream mining. This New Zealand-based project is related to Weka, a Java-based framework for machine learning which I’ve used quite a bit over the years. Data stream mining differs from plain old data mining in that the data is assumed to arrive quickly and continuously, as in a stream, and in an unpredictable order. Therefore the full data set will typically be many times larger than your computer’s memory (which already rules out some commonly used algorithms), and each example can only be briefly examined once, after which it is discarded. Therefore the statistical model has to be updated incrementally, and often must be ready to be applied at any point between training examples.

I also came across a press release describing version 2.0 of KnowledgeMiner for Excel, a data mining software apparently used by customers like Pfizer, NASA and Boeing, and which is based on GMDH (Group Method of Data Handling), a paradigm I hadn’t heard about before. I failed to install KnowledgeMiner for Excel for my Mac due to some obscure install error, but from what I gather, the GMDH framework involves a kind of automatic model selection, making it easier to use for non-experts in data mining. (Of course I haven’t tried it, so it’s hard to evaluate the claim.) The example data set provided with the software package has to do with climate data and modeling, so it should be fun to try as soon as I get it working:

The new KnowledgeMiner is now capable of high-dimensional modeling and prediction of climate and has an included example using air and sea surface temperature data. This is a first for a data-mining software package: to offer anyone the ability to see for themselves that global temperatures are rising steadily, using publicly available data. The biggest surprise is seeing that the changes are greatest and accelerating in the northern latitudes. By using data from the past, KnowledgeMiner (yX) can show predictions for future years. Go to this link to see the climate change data displayed graphically in a slideshow through the year 2020:

There’s also an interesting new toolkit for web mining from BixoLabs. They’ve built what they call an elastic web mining platform in Amazon’s Elastic Compute Cloud (on top of Hadoop, Cascading and a web mining framework called Bixo, for those of you who care). The whole thing is pre-configured and scalable, and from the tutorials on the site, it seems pretty easy to set it up to crawl the web to your heart’s content.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers