Lessons from medical predictive analytics competititions
Presentations from the organizers and best performing teams of the Systems Biology Improver Challenge, which I have covered before, have been put online. There is a ton of interesting stuff here for people like me who are interested in prediction contests, machine learning, and/or medical bioinformatics.
To rewind a little bit, the SBV Improver Challenge was a set of competitions somewhat similar to Kaggle’s competitions (indeed, there is a presentation by Will Cukierski of Kaggle among these presentations) in that participants needed to build a predictive model for classifying diseased versus healthy samples for four diseases based on gene expression data (simplifying a little bit), and then predict the disease status for an unlabeled test set. The competitions differed from Kaggle competitions in that the training data sets were not fixed – the organizers just pointed to some suggested public data sets, which I think was a nice way to do it.
The point of these sub-competitions was to establish whether gene expression profiles are truly predictive of disease states in general or just in specific cases.
Anyway – I enjoyed browsing these presentations a lot. Some take-home messages:
- Different diseases vary immensely in how much predictive signal the gene expression data contains. Psoriasis was much easier to classify correctly based on these data than the other diseases: multiple sclerosis, COPD and lung cancer. The nature of the disease was an incomparable more important variable than anything related to normalization, algorithms, pre-processing etc.
- It was suggested that providing the whole test set at once may be a bad idea, because it may reveal information (such as cluster structure) that would not be known in a real-life scenario (for instance, if you were to go to the doctor to get a gene expression measurement from your own tissue.) It was suggested that next time, the platform would just provide one data point at a time to be predicted. Indeed, I have thought a lot about this “single-sample classification” problem in connection with gene expression data lately. As anyone who has worked with (microarray or RNA-seq) gene expression data knows, there are severe experimental batch effects in these data that are usually removed by normalizing all data points together, but this cannot always be done in practice.
- In these competitions, none of the alleged superiority of random forest classifiers or GLMs (cf. http://strataconf.com/strata2012/public/schedule/detail/22658) was in evidence, with “simple methods” such as linear discriminant analysis and Mann-Whitney tests for feature selection performing the best. Yes, I know there is probably no statistical significance here due to the small sample size …
- Speaking of random forests and GLMs, I was surprised to learn about the RGLM (Random Generalized Linear Model), which is kind of a mix of the two; generalized linear models are built on random subsets of the training examples and the features (like in random forest classifiers) and predictions of many of these models are aggregated to get a final prediction. The presentation is here.
The “lessons learned” presentation discusses (most of) these points and more, and is interesting throughout.