Follow the Data

A data driven blog

Archive for the tag “challenge”

Cancer, machine learning and data integration

Machine Learning Methods in the Computational Biology of Cancer is an arXiv preprint of a pretty nice article dealing with some analysis that can be used for high-dimensional biological (and other) data – although the examples come from cancer research, they could easily be about something else. This paper does a good job of describing penalized regression methods such as lasso, ridge regression and elastic net. It also goes into compressed sensing and its applicability to biology, although cautioning that it cannot yet be straightforwardly applied to biological data. This is because compressed sensing is based on the assumption that one can choose the “measurement matrix” freely, whereas in biology, it (usually called “design matrix” in this context) is already fixed.

The Critical Assessment of Massive Data Analysis (CAMDA) 2014 conference has released its data analysis challenges. Last year’s challenges on toxicogenomics and toxicity prediction will be reprised (perhaps in modified form, I didn’t check), but they have added a new challenge which I find interesting because it focuses on data integration (combining distinct data sets on gene, protein and micro-RNA expression as well as gene structural variations and DNA methylation) and uses published data from the International Cancer Genome Consortium (ICGC). I think it’s a good thing to re-analyze, mash up and meta-analyze data from these large-scale projects, and the CAMDA challenges are interesting because they are so open-ended, in contrast to e g Kaggle challenges (which I also like but in a different way). The goals in the CAMDA challenges are quite open to interpretation (and also ambitious), for instance:

  • Question 1: What are disease causal changes? Can the integration of comprehensive multi-track -omics data give a clear answer?
  • Question 2: Can personalized medicine and rational drug treatment plans be derived from the data? And how can we validate them down the road?

Online analysis contests and animal testing

I’d like to draw your attention to two online data analysis challenges that both, in their way, address drug testing on animals and how results of such testing translate to human physiology.

CAMDA 2013 (12th international conference on critical assessment of massive data analysis) is a conference that focuses on massive data sets in the life sciences. This year, it has two associated analysis challenges, one of which is “prediction of drug compatibility from an extremely large toxicogenomic data set.” The data set used in this challenge contains over dataset contains over 20,000 genome expression microarrays, each measuring perhaps about 20,000 genes in the liver of rats treated with mainly human drugs. There are two questions that the organizers want to address:

  • Question 1: Can we replace animal studies with in vitro assays? [“in vitro” literally means “in glass”, for instance in a test tube]
  • Question 2: Can we predict liver injury in humans using toxicogenomics data from animals?

Meanwhile, the SBV (systems biology verification) Improver project, which ran a prediction contest last year that was covered in this blog, is starting its new Species Translation Challenge,  which also aims to address how “translatable” biological events in rats or mice are to humans. This challenge, which has four sub-challenges, aims to answer the following questions:

  • Can the perturbations of signaling pathways in one species predict the response to a given stimulus in another species?
  • Which biological pathways, functions and gene expression profiles are most robustly translated?
  • Which gene expression profiles and associated biological pathways / functions are most robustly translated?
  • Does translation depend on the nature of the stimulus or data type collected such as protein phosphorylation and cytokine responses?
  • Which computational methods are most effective for inferring gene, phosphorylation and pathway response from one species to another?

I think it will be very interesting to see how these challenges play out and to compare their respective outcomes.

This & that

  • The BigML blog has been on a roll lately with many interesting posts. I particularly liked this one, Bedtime for Boosting, which goes pretty deep into benchmarking various versions of the boosting algorithms we all know and love (?).
  • Mark Gerstein of Yale University has a nice slide deck about the big data blizzard in genomics (<– pdf link). There are lots of ideas here about how to build predictive models based on, for example, ENCODE data. I won’t get into the ongoing controversy around ENCODE here, suffice to say that I think the ENCODE data sets are a good resource for starting to build statistical models of genomic regulation on a larger scale.
  • The O’Reilly Radar has a good post about how Python data tools just keep getting better.
  • An “ultra-tricky” bioinformatics challenge will be run by Genome Biology on DNA Day (April 25), with a “truly awesome” prize. Intriguing.

Post Navigation