Follow the Data

A data driven blog

Online analysis contests and animal testing

I’d like to draw your attention to two online data analysis challenges that both, in their way, address drug testing on animals and how results of such testing translate to human physiology.

CAMDA 2013 (12th international conference on critical assessment of massive data analysis) is a conference that focuses on massive data sets in the life sciences. This year, it has two associated analysis challenges, one of which is “prediction of drug compatibility from an extremely large toxicogenomic data set.” The data set used in this challenge contains over dataset contains over 20,000 genome expression microarrays, each measuring perhaps about 20,000 genes in the liver of rats treated with mainly human drugs. There are two questions that the organizers want to address:

  • Question 1: Can we replace animal studies with in vitro assays? ["in vitro" literally means "in glass", for instance in a test tube]
  • Question 2: Can we predict liver injury in humans using toxicogenomics data from animals?

Meanwhile, the SBV (systems biology verification) Improver project, which ran a prediction contest last year that was covered in this blog, is starting its new Species Translation Challenge,  which also aims to address how “translatable” biological events in rats or mice are to humans. This challenge, which has four sub-challenges, aims to answer the following questions:

  • Can the perturbations of signaling pathways in one species predict the response to a given stimulus in another species?
  • Which biological pathways, functions and gene expression profiles are most robustly translated?
  • Which gene expression profiles and associated biological pathways / functions are most robustly translated?
  • Does translation depend on the nature of the stimulus or data type collected such as protein phosphorylation and cytokine responses?
  • Which computational methods are most effective for inferring gene, phosphorylation and pathway response from one species to another?

I think it will be very interesting to see how these challenges play out and to compare their respective outcomes.

GitHub goodies

  • The first post from the brand new Nuts ‘n Bolts blog talks about hash kernels and how to use them to represent arbitrary input data in a format suitable for machine learning. There is a GitHub repo called hashkernel that demonstrates the approach. The tag line for the repo is great: A demonstration of how to use hash kernels for ridiculously unprincipled machine learning.
  • This iPython notebook shows how to write a (greedy, not de Bruijn) genome assembler using tools available at the Pacific Biosciences GitHub repo. Titus Brown also has a repo showing how to implement a de Bruijn graph based ASCII assembler on top of Bloom filters. 

This & that

  • The BigML blog has been on a roll lately with many interesting posts. I particularly liked this one, Bedtime for Boosting, which goes pretty deep into benchmarking various versions of the boosting algorithms we all know and love (?).
  • Mark Gerstein of Yale University has a nice slide deck about the big data blizzard in genomics (<– pdf link). There are lots of ideas here about how to build predictive models based on, for example, ENCODE data. I won’t get into the ongoing controversy around ENCODE here, suffice to say that I think the ENCODE data sets are a good resource for starting to build statistical models of genomic regulation on a larger scale.
  • The O’Reilly Radar has a good post about how Python data tools just keep getting better.
  • An “ultra-tricky” bioinformatics challenge will be run by Genome Biology on DNA Day (April 25), with a “truly awesome” prize. Intriguing.

Topology and data analysis: Gunnar Carlsson and Ayasdi

A few months ago, I read in Wired [Data-Visualization Firm’s New Software Autonomously Finds Abstract Connections] and Guardian [New big data firm to pioneer topological data analysis] about Ayasdi, the new data visualization & analytics company founded by professor Gunnar Carlsson at Stanford that has received millions of funding from Khosla Ventures, DARPA and other places. Today, I had the opportunity to hear Carlsson speak at the Royal Institute of Technology (KTH) in Stockholm about the mathematics underlying Ayasdi’s tools. I was very eager to hear how topology (Carlsson’s specialty) connects to data visualization, and about their reported success in classifying tumor samples from patients [Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival].

The talk was very nice. Actually there were two talks – one for a more “general audience” (which, in truth, probably consisted mostly of hardcore maths or data geeks) and one that went much deeper into the mathematics – and completely over my head.

One thing that intrigued me was that almost all of his examples were taken from biology: disease progression, cell cycle gene expression, RNA hairpin folding, copy number variations, mapping genotypes to geographic regions … the list goes on. I suppose it’s simply because Carlsson happens to work a lot with biologists at Stanford, but could it be that biology is an especially fertile area for innovation in data analysis? As Carlsson highlights in a review paper [Topology and Data], data sets from high-throughput biology often have many dimensions that are superfluous or have unknown significance, and there is no good understanding of which distance measures between data points that should be used. That is one reason to consider methods that do not break down when the scale is changed – such as topological methods, which are insensitive to the actual choice of metrics.

(Incidentally, I think there is a potential blog post waiting to be written about how many famous data scientists have come out of a biology/bioinformatics background. Names like Pete Skomoroch, Toby Segaran and Michael Driscoll come to mind, and those were just the ones I thought of instantly.)

Another nice aspect of the talk was that it was no sales pitch for Ayasdi (the company was hardly even mentioned) but more of a bird’s eye view of topology and its relation to clustering and visualization. In my (over)simplified understanding, the methods presented represent the data as a network where the nodes, which are supposed to represent “connected components” in an ideal scenario, are clusters derived using, e.g., hierarchical clustering. However, there is no cutoff value defined for breaking the data into clusters, but instead the whole outcome of the clustering – its profile, so to speak – is encoded and used in the following analysis. Carlsson mentioned that one of the points of this sort of network representation of data was to “avoid breaking things apart”, as clustering algorithms do. He talked about classifying the data point clouds using “barcodes”, ensuring persistence across changes of scale. The details of how these barcodes were calculated were beyond my comprehension.

Carlsson showed some examples of how visualizations created using his methods improved on hierarchical clustering dendrograms or PCA/MDS plots. He said that one of the advantages of the method is that it can identify small features in a large data set. These features would, he said, be “washed out” in PCA and not be picked up by standard clustering methods.

I look forward to learning more about topological data analysis. There are some (links to) papers at Ayasdi’s web site if you are interested, e.g.: Extracting insights from the shape of complex data using topology, Topology and Data (already mentioned above), Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition.

RNA-seq analysis slides from data integration workshop

In case there are any genomics people visiting this blog, here are PDF slides for a presentation I gave in February 2013 at the High Throughput Omics Data Integration Workshop in Barcelona. It was a 90-minute presentation so there are 85 (!) slides: HighThroughputOmics_DataIntegration_Workshop_Barcelona_Feb2013_MikaelHuss

Online course experiences: Coursera Data Analysis and Syracuse U. Data Science

I’ve been following two online data analysis related courses during the past few months: the Data Analysis course given by Johns Hopkins U. through Coursera and the Introduction to Data Science course given by Syracuse U.  through Coursesites.

The Data Analysis course is the third one that I have enrolled in on Coursera, and the first one where I have completed all the coursework (I received my statement of accomplishment this past weekend, yippee!). Of the two previous courses I had enrolled in, I had tried to follow one but given up because of problems with the platform incorrectly grading the quizzes – a childish thing to quit a course over, because it’s the things you learn that should matter, but I felt that the weird grading made me uncertain about what parts of the material I had really understood.

I think the Data Analysis course was quite good, because it focused not only on R and statistics (which is great) but also on more practical aspects of data analysis, like how you might organize your files and write up a good analysis report. It introduced me to things like R markdown and knitr, which I had heard about but not used until now. The course contents were also surprisingly up to date, with things like the medley R package being included in the video lectures. This package, which was developed by a Kaggle competitor for constructing ensemble models more easily, was first mentioned in January 2013 on a Kaggle forum and doesn’t yet exist as an R package, yet it was covered in the course with nice examples of how to run it!

There is a “post-mortem” podcast at Simply Statistics where Jeff Leek (the main instructor of the course) and Roger Peng discuss what went right and what went wrong.

The course videos are on YouTube and course lectures are available on GitHub; both videos and lectures are tagged by week. Some numbers on participation given by Jeff Leek:

There were approximately 102,000 students enrolled in the course, about 51,000 watched videos, 20,000 did quizzes, and 5,500 did/graded the data analysis assignments.

Personally, I would perhaps have liked the contents to be slightly more difficult (because I came in with a fair amount of subject knowledge) but on the other hand the given level of difficulty let me get away with spending 3-5h per week on average on the course, as advertised. I think many students used a lot more.

The other course that I participated in, Introduction to Data Science from Syracuse University, was similar to the Data Analysis course in a way, specifically, in that it used R the vehicle for introducing statistical concepts. However, this course was much more limited in scope and basically did not assume that the students had had any prior exposure to statistics or programming. I felt that this was a mismatch for me and in the end did not finish all of the coursework. I did read the accompanying textbook which, in parts, did a very good job of explaining the value of data analysis in real-world scenarios. I felt that the course would be most useful for people who are curious about “big data” and “data science” and want to dip their toes into it a little bit but not necessarily work with data analysis. Maybe this was the intention.

Foolhardy as I am, I plan to take another MOOC data science course beginning in May, namely Introduction to Data Science. I’ll report back here afterwards!

 

“The secret of the big guys” (from FastML)

FastML has an intriguing post called The secret of the big guys, which starts thus:

Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods, you can get better results than from a random forest. And maybe even faster.

It definitely sounds worth trying, and there are links to some papers with Andrew Ng as a co-author in the blog post. I haven’t had time to try out Sofia-ML yet, or to implement this approach by hand, but I’ll definitely give it a shot at some distant point in time when I don’t have my hands full.

Cost-sensitive learning

I have been looking at a binary classification problem (basically a problem where you are supposed to predict “yes” or “no” based on a set of features) where the cost of misclassifying a “yes” as a “no” is much more expensive than misclassifying a “no” as a “yes”.

Searching the web for hints about how to approach this kind of scenario, I discovered that there are some methods explicitly designed for this, such as MetaCost [pdf link] by Domingos and cost-sensitive decision trees, but I also learned that there are a couple of very general relationships that apply for a scenario like mine.

In this paper (pdf link) by Ling and Sheng, it is shown that if your (binary) classifier can produce a posterior probability estimate for predicting (e g) “yes” in a test set, then one can make that classifier cost-sensitive simply by choosing the classification threshold (which is often taken as 0.5 in non-cost-sensitive classifiers) according to p_threshold = FP / (FP + FN), where FP is the false positive rate and FN is the false negative rate. Equivalently, one can “rebalance” the original samples by sampling “yes” and “no” examples proportionally so that p_threshold becomes 0.5. That is, the prior probabilities of the “yes” and “no” classes and the costs are interchangeable.

So in principle, one could either manipulate the classification threshold or the training set proportions to get a cost-sensitive classifier. Of course, further adjustment may be needed if the classifier you are using does not produce explicit probability estimates.

The paper is worth reading in full as it shows clearly why these relationships hold and how they are used in various cost-sensitive classification methods.

P.S. This is of course very much related to the imbalanced class problem that I wrote about in an earlier post, but at that time I was not thinking that much about the classification-cost aspect yet.

Industrial postdoc position in genomics & big data (Stockholm)

There is an interesting postdoc position available at AstraZeneca in Mölndal, but located at SciLifeLab in Stockholm. This (bioinformatics) position is about next generation sequencing and data integration, with a definite “big data” slant from what I have heard. Check it out!

 

Quick links

  • Data Dealer looks like it’s going to be a blast. Dust off your high-school German and watch the great video trailer. I like the slogan “Legal, illegal, scheißegal!”
  • Ayasdi is a startup working on “topological data analysis” and has a visual exploration tool that looks neat. As it happens, I saw a demo today from another “visual analytics company.” It may not be a bad idea to make use of the human visual system’s amazing powers, but the problem is to keep statistical significance close at hand.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers