The importance of proper cross-validation and experimental design
A study which claimed to have found a genetic signature for autism, which supposedly could be used for an screening test for autism risk at birth or in infancy, has recently been debunked in a new paper. This embarrassment could have been avoided by the authors of the original paper if they had followed some well-known principles in statistics and data analysis. The problems with the paper have been laid out well both in Ed Yong’s article in The Scientist, the refutation paper and its supplementary material, and a blog post at Genomes Unzipped, if you are interested in the gory details. Briefly:
- The authors had first performed feature selection on the whole data set, and then divided the data set into a training set and a validation set to assess the performance. As every data miner knows (?), this is not a good idea, as you will be “contaminating” your model with information about the test set. This can easily lead to poor generalization.
- The case and control (disease-bearing vs. healthy) groups had slightly different ethnic composition, leading to a confounding of the disease risk with irrelevant ancestral genetic differences.
- Another source of confounding: the case and control groups were characterized at different times and on different technological platforms, leading to so called batch effects where the differences due to disease cannot be disentangled from technology-specific biases (which always exist to some degree).
- Always make sure you have a completely separate data set to test your final model on after all parameter tuning and training has been done.
- Think about your experimental design beforehand so that you minimize unrelated sources of variation, possibly using some kind of randomization scheme. If the effect you are interested in is varying together with some other factor like measurement technology or ethnic background, it is very hard or impossible to pull out any real signal.