Follow the Data

A data driven blog

Archive for the tag “machine-learning”

Genomics Today and Tomorrow presentation

Below is a Slideshare link/widget to a presentation I gave at the Genomics Today and Tomorrow event in Uppsala a couple of weeks ago (March 19, 2015).

I spoke after Jonathan Bingham of Google Genomics and talked a little bit about how APIs, machine learning, and what I call “querying by dataset” could make life easier for bioinformaticians working on data integration. In particular, I gave examples of a few types of queries that one would like to be able to do against “all public data” (slides 19-24).

Not long after, I saw this preprint (called “Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees”) that seems to provide part of the functionality that I was envisioning – in particular, the ability to query public sequence repositories by content (using a sequence as a query), rather than by annotation (metadata). The beginning of the abstract goes like this:

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments.

Lessons learned from mining heterogeneous cancer data sets

How much can we learn about cancer treatment and prevention by large-scale data collection and analysis?

An interesting paper was just published: “Assessing the clinical utility of cancer genomic and proteomic data across tumor types“. I am afraid the article is behind a paywall, but no worries – I will summarize the main points here! Basically the authors have done a large-scale data mining study of data published within The Cancer Genome Atlas (TCGA) project, a very ambitious effort to collect molecular data on different kinds of tumors. The main question they ask is how much clinical utility these molecular data can add to conventional clinical information such as tumor stage, tumor grade, age and gender.

The lessons I drew from the paper are:

  • The molecular data does not add that much predictive information beyond the clinical information. As the authors put it in the discussion, “This echoes the observation that the number of cancer prognostic molecular markers in clinical use is pitifully small, despite decades of protracted and tremendous efforts.” It is an unfortunate fact of life in cancer genomics that many molecular classifiers (based on gene expression patterns usually) have been proposed to predict tumor severity, patient survival and so on, but different groups keep coming up with different gene sets and they tend not to be validated in independent cohorts.
  • When looking at what factors explain most of the variation, the type of tumor explains the most (37.4%), followed by the type of data used (that is, gene expression, protein expression, micro-RNA expression, DNA methylation or copy number variations) which explains 17.4%, with the interaction between tumor type and data type in third place (11.8%), suggesting that some data types are more informative for certain tumors than others. The algorithm used is fairly unimportant (5.2%). At the risk of drawing unwarranted conclusions, it is tempting to generalize this into something like this: the most important factor is the intrinsic difficulty of modeling the system, the next most important factor is the decision of what data to collect and/or feature engineering, while the type of algorithm used for learning the model comes far behind.
  • Perhaps surprisingly, there was essentially no cross-tumor predictive power between models. (There was one exception to this.) That is, a model built for one type of tumor was typically useless when predicting the prognosis for another tumor type.
  • Individual molecular features (expression levels of individual genes, for instance) did not add predictive power beyond what was already in the clinical information, but in some cases molecular subtype did. The molecular subtype is a “molecular footprint” derived using consensus NMF (nonnegative matrix factorization, an unsupervised method that can be used for dimension reduction and clustering). This footprint that described a general pattern was informative whereas the individual features making up the footprint weren’t. This seems consistent with the issue mentioned above about gene sets failing to consistently predict tumor severity. The predictive information is on a higher level than the individual genes.

The authors argue that one reason for the failure of predictive modeling in cancer research has been that investigators have relied too much on p values to say something about the clinical utility of their markers, when they should instead have focused more on the effect size, or the magnitude of difference in patient outcomes.

They also make a good point about reliability and reproducibility. I quote: “The literature of tumor biomarkers is plagued by publication bias and selective and/or incomplete reporting“. To help combat these biases, the authors (many of whom are associated with Sage Biosystems, who I have mentioned repeatedly on this blog) have made available an open model-assessment platform, including of course all the models from the paper itself, but which can also be used to assess your own favorite model.

Cancer, machine learning and data integration

Machine Learning Methods in the Computational Biology of Cancer is an arXiv preprint of a pretty nice article dealing with some analysis that can be used for high-dimensional biological (and other) data – although the examples come from cancer research, they could easily be about something else. This paper does a good job of describing penalized regression methods such as lasso, ridge regression and elastic net. It also goes into compressed sensing and its applicability to biology, although cautioning that it cannot yet be straightforwardly applied to biological data. This is because compressed sensing is based on the assumption that one can choose the “measurement matrix” freely, whereas in biology, it (usually called “design matrix” in this context) is already fixed.

The Critical Assessment of Massive Data Analysis (CAMDA) 2014 conference has released its data analysis challenges. Last year’s challenges on toxicogenomics and toxicity prediction will be reprised (perhaps in modified form, I didn’t check), but they have added a new challenge which I find interesting because it focuses on data integration (combining distinct data sets on gene, protein and micro-RNA expression as well as gene structural variations and DNA methylation) and uses published data from the International Cancer Genome Consortium (ICGC). I think it’s a good thing to re-analyze, mash up and meta-analyze data from these large-scale projects, and the CAMDA challenges are interesting because they are so open-ended, in contrast to e g Kaggle challenges (which I also like but in a different way). The goals in the CAMDA challenges are quite open to interpretation (and also ambitious), for instance:

  • Question 1: What are disease causal changes? Can the integration of comprehensive multi-track -omics data give a clear answer?
  • Question 2: Can personalized medicine and rational drug treatment plans be derived from the data? And how can we validate them down the road?

Machine learning goings-on in Stockholm

The predictive analytics scene in Stockholm hasn’t been very vibrant, but at least we now have the Machine Learning Stockholm meetup group, which had its inaugural session hosted at Spotify on February 25 under the heading “Graph-parallel machine learning”. There was a short introduction to graph-centric computing paradigms and hands-on demos of GraphLab and Giraph.

The Stockholm R useR group has hosted good analytics-themed meetups from time to time. This past Saturday (March 8) On Saturday (March 29), they organized will organize a hackathon with two tracks: one predictive modelling contest about predicting flat prices (always a timely theme in this town) and one “R for beginners” track.

Finally, the Stockholm-based Watty are looking for a head of machine learning development (or maybe several machine learning engineers; see ad) to lead a team that will diagnose the energy use of buildings and work to minimize energy waste.

A quotable Domingos paper

I’ve been (re-)reading Pedro Domingos’ paper, A Few Useful Things to Know About Machine Learning, and wanted to share some quotes that I like.

  • (…) much of the “folk knowledge” that is needed to successfully develop machine learning applications is not readily available in [textbooks].
  • Most textbooks are organized by representation [rather than the type of evaluation or optimization] and it’s easy to overlook the fact that the other components are equally important.
  • (…) if you hire someone to build a classifier, be sure to keep some of the data to yourself and test the classifier they give you on it.
  • Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs.
  • What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (…)
  • (…) strong false assumptions can be better than weak true ones, because a learner with the latter needs more data to avoid overfitting.
  • Even with a moderate dimension of 100 and a huge training set of a trillion examples, the latter cover only a fraction of about 10^-18 of the input space. This is what makes machine learning both necessary and hard.
  • (…) the most useful learners are those that facilitate incorporating knowledge.

Another interesting recent paper by Domingos is What’s Missing in AI: The Interface Layer.

Previously, Domingos has done a lot of interesting work on, for instance, why Naïve Bayes often works well even though its assumptions are not fulfilled, and why bagging works well. Those are just the ones I remember, I’m sure there is a lot more.

“The secret of the big guys” (from FastML)

FastML has an intriguing post called The secret of the big guys, which starts thus:

Are you interested in linear models, or K-means clustering? Probably not much. These are very basic techniques with fancier alternatives. But here’s the bomb: when you combine those two methods, you can get better results than from a random forest. And maybe even faster.

It definitely sounds worth trying, and there are links to some papers with Andrew Ng as a co-author in the blog post. I haven’t had time to try out Sofia-ML yet, or to implement this approach by hand, but I’ll definitely give it a shot at some distant point in time when I don’t have my hands full.

Cost-sensitive learning

I have been looking at a binary classification problem (basically a problem where you are supposed to predict “yes” or “no” based on a set of features) where the cost of misclassifying a “yes” as a “no” is much more expensive than misclassifying a “no” as a “yes”.

Searching the web for hints about how to approach this kind of scenario, I discovered that there are some methods explicitly designed for this, such as MetaCost [pdf link] by Domingos and cost-sensitive decision trees, but I also learned that there are a couple of very general relationships that apply for a scenario like mine.

In this paper (pdf link) by Ling and Sheng, it is shown that if your (binary) classifier can produce a posterior probability estimate for predicting (e g) “yes” in a test set, then one can make that classifier cost-sensitive simply by choosing the classification threshold (which is often taken as 0.5 in non-cost-sensitive classifiers) according to p_threshold = FP / (FP + FN), where FP is the false positive rate and FN is the false negative rate. Equivalently, one can “rebalance” the original samples by sampling “yes” and “no” examples proportionally so that p_threshold becomes 0.5. That is, the prior probabilities of the “yes” and “no” classes and the costs are interchangeable.

So in principle, one could either manipulate the classification threshold or the training set proportions to get a cost-sensitive classifier. Of course, further adjustment may be needed if the classifier you are using does not produce explicit probability estimates.

The paper is worth reading in full as it shows clearly why these relationships hold and how they are used in various cost-sensitive classification methods.

P.S. This is of course very much related to the imbalanced class problem that I wrote about in an earlier post, but at that time I was not thinking that much about the classification-cost aspect yet.

Large-scale machine learning course & streaming organism classification

The NYU Large Scale Machine Learning course looks like it will be very worthwhile to follow. The instructors, John Langford and Yann Le Cun, are both key figures in the machine learning field – for instance, the former developed Vowpal Wabbit and the latter has done pioneering work in deep learning. It is not an online course like those at Coursera et al., but they have promised to put lecture videos and slides online. I’ll certainly try to follow along as best I can.


There is an interesting Innocentive challenge going on, called “Identify Organisms from A Stream of DNA Sequences.” This is interesting to me both because of the subject matter (classification based on DNA sequences) and also because the winner is explicitly required to submit an efficient, scalable solution (not just a good classifier.) Also, the prize sum is one million US dollars! It’s exactly this kind of algorithms that will be needed to enable the “genomic observatories” that I have mentioned before on this blog which will continuously stream out sequences obtained from the environment.

Explaining predictions and imbalanced data sets

In many applications it is of interest to know why a specific prediction was made for an individual instance. For instance, a credit card company might want to be able to explain why a customer’s card was blocked even though there was no actual fraud. Obviously, this kind of “reverse engineering” of predictions is very important for improving the predictive model, but many (most?) machine learning methods don’t have a straightforward way of explaining why a certain specific prediction was made, as opposed to displaying a “decision surface” or similar that describes in general what matters for the classification. There are exceptions, like decision trees, where the resulting classifier shows in a transparent way (using binary tests on the features) what the prediction will be for a specific example, and naïve Bayes classifiers, where it is straightforward to quantify how much each feature value contributes to the prediction.

Someone asked a question about this on MetaOptimize QA and the answers contained some interesting pointers to published work about this interesting problem. Alexandre Passos recommends two papers, How To Explain Individual Classification Decisions (pdf link) by Baehrens et al., and An Efficient Explanation of Individual Classifications
using Game Theory (pdf link) by Strumbelj and Kononenko. The Baehrens et al. paper defines something called an “explanation vector” for a data point, which is (if I have understood the paper correctly) a vector that has the same dimension as the data point itself (the number of features) and that points towards the direction of maximum “probability flow” away from the class in question. The entries in this vector that have large absolute values correspond to features that have a large local influence on which class is predicted. The problem is that this vector typically cannot be calculated directly from the classification model (except in some cases like Gaussian Processes), so it has to be estimated using some sort of smoothing of the data; in the paper they use Parzen windows.

I would really like to have code to do this (ideally, an R package) but couldn’t find any in the paper.

The Strumbelj paper uses a completely different approach which I frankly can’t really wrap my head around, but is based on game theory, specifically the idea that an explanation of a classifier’s prediction can be treated as something called a “coalitional form game” where the instance’s feature values form a “coalition” which causes a change in the classifier’s prediction. This lets them use something called the “Shapley value” to assess the contributions of each feature.

Again, it would be really nice to have the code for this, even though the authors state that “the explanation method is a straightforward Java implementation of the equations presented in this paper.”

On another “practical machine learning tips” note, the newish and very good blog linked to a very interesting article, Learning from Imbalanced Data. In working with genome-scale biological data, we often encounter imbalanced data scenarios where the positive class may contain, say, 1,000 instances, and the negative class 1,000,000 instances. I knew from experience that it is not straightforward to build and assess the performance of classifiers for this type of data sets (for example, concepts like accuracy and ROC-AUC become highly problematic), but like Carl at I was surprised by the sheer variety of approaches outlined in this review. For instance, there are very nice expositions of the dangers of under- and oversampling and how to perform more informed versions of those. Also, I had realized that cost-sensitive evaluation methods could be useful (it may be much more important to classify instances in the rare class correctly, for example) but before reading this review I hadn’t thought about how to integrate cost-sensitivity into the actual learning method.

Practical advice for machine learning: bias, variance and what to do next

The online machine learning course given by Andrew Ng in 2011 (available here among many other places, including YouTube) is highly recommended in its entirety, but I just wanted to highlight a specific part of it, namely the “Practical advice part”, which touches on things that are not always included in machine learning and data mining courses, like “Deciding what do to do next” (the title of this lecture) or “debugging a learning algorithm” (the title of the first slide in that talk).

His advice here focuses on the concepts of the bias and variance  in statistical learning. I had been vaguely aware of the concepts of “bias and variance tradeoff” and “bias/variance decomposition” for a long time, but I had always viewed those as theoretical concepts that were mostly helpful for thinking about the properties of learning algorithms; I hadn’t thought that much about connecting them to the concrete tasks of model development.

As Andrew Ng explains, bias relates to the ability of your model function to approximate the data, and so high bias is related to under-fitting. For example, a linear regression model would have high bias when trying to model a quadratic relationship – no matter how you set the parameters, you can’t get a good training set error.

Variance on the other hand is about the stability of your model in response to new training examples. An algorithm like K-nearest neighbours (K-NN) has low bias (because it doesn’t really assume anything special about the distribution of the data points) but high variance, because it can easily change its prediction in response to the composition of the training set. K-NN can fit the training data very well if K is chosen small enough (in the extreme case with K=1 the fit will be perfect) but may not generalize well to new examples. So in short, high variance is related to over-fitting.

There is usually a tradeoff between bias and variance, and many learning algorithms have a built-in way to control this tradeoff, like for instance a regularization parameter that penalizes complex models in many linear modelling type approaches, or indeed the K value in K-NN. A lot more about the bias-variance tradeoff can be found in this Andrew Ng lecture.

Now, based on these concepts, Ng goes on to suggest some ways to modify your model when you discover it has a high error on a test set. Specifically, when should you:

– Get more training examples?

(Answer: When you have high variance. More training examples will not fix a high bias, because your underlying model will still not be able to approximate the correct function.)

– Try smaller sets of features?

(Answer: When you have higher variance. Ng says, if you think you have high bias, “for goodness’ sake don’t waste your time by trying to carefully select the best features”)

– Try to obtain new features?

(Answer: Usually works well when you suffer from high bias.)

Now you might wonder how you know that you have either high bias or high variance. This is where you can try to plot learning curves for your problem. You plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes. (This of course requires you to randomly select examples from your training set, train models on them and assess the performance for each subset.)

In the typical high bias case, the cross-validation error will initially go down and then plateau as the number of training examples grow. (With high bias, more data doesn’t help beyond a certain point.) The training error will initially go up and then plateau at approximately the level of the cross-validation error (usually a fairly high level of error). So if you have similar cross-validation and training errors for a range of training set sizes, you may have a high-bias model and should look into generating new features or changing the model structure in some other way.

In the typical high variance case, the training error will increase somewhat with the number of training examples, but usually to a lower level than in the high-bias case. (The classifier is now more flexible and can fit the training data more easily, but will still suffer somewhat from having to adapt to many data points.) The cross-validation error will again start high and decrease with the number of training examples to a lower but still fairly high level. So the crucial diagnostic for the high variance case, says Ng, is that the difference between the cross-validation error and the training set error is high. In this case, you may want to try to obtain more data, or if that isn’t possible, decrease the number of features.

To summarize (using pictures from this PDF):

– Learning curves can tell you whether you appear to suffer from high bias or high variance.

– You can base your next step on what you found using the learning curves:

I think it’s nice to have this kind of rules of thumb when you get stuck, and I hope to follow up this post pretty soon with another one that deals with a relatively recent paper which suggests some neat ways to investigate a classification problem using sets of classfication models.

Post Navigation