Follow the Data

A data driven blog

Archive for the category “Articles”

Lessons learned from mining heterogeneous cancer data sets

How much can we learn about cancer treatment and prevention by large-scale data collection and analysis?

An interesting paper was just published: “Assessing the clinical utility of cancer genomic and proteomic data across tumor types“. I am afraid the article is behind a paywall, but no worries – I will summarize the main points here! Basically the authors have done a large-scale data mining study of data published within The Cancer Genome Atlas (TCGA) project, a very ambitious effort to collect molecular data on different kinds of tumors. The main question they ask is how much clinical utility these molecular data can add to conventional clinical information such as tumor stage, tumor grade, age and gender.

The lessons I drew from the paper are:

  • The molecular data does not add that much predictive information beyond the clinical information. As the authors put it in the discussion, “This echoes the observation that the number of cancer prognostic molecular markers in clinical use is pitifully small, despite decades of protracted and tremendous efforts.” It is an unfortunate fact of life in cancer genomics that many molecular classifiers (based on gene expression patterns usually) have been proposed to predict tumor severity, patient survival and so on, but different groups keep coming up with different gene sets and they tend not to be validated in independent cohorts.
  • When looking at what factors explain most of the variation, the type of tumor explains the most (37.4%), followed by the type of data used (that is, gene expression, protein expression, micro-RNA expression, DNA methylation or copy number variations) which explains 17.4%, with the interaction between tumor type and data type in third place (11.8%), suggesting that some data types are more informative for certain tumors than others. The algorithm used is fairly unimportant (5.2%). At the risk of drawing unwarranted conclusions, it is tempting to generalize this into something like this: the most important factor is the intrinsic difficulty of modeling the system, the next most important factor is the decision of what data to collect and/or feature engineering, while the type of algorithm used for learning the model comes far behind.
  • Perhaps surprisingly, there was essentially no cross-tumor predictive power between models. (There was one exception to this.) That is, a model built for one type of tumor was typically useless when predicting the prognosis for another tumor type.
  • Individual molecular features (expression levels of individual genes, for instance) did not add predictive power beyond what was already in the clinical information, but in some cases molecular subtype did. The molecular subtype is a “molecular footprint” derived using consensus NMF (nonnegative matrix factorization, an unsupervised method that can be used for dimension reduction and clustering). This footprint that described a general pattern was informative whereas the individual features making up the footprint weren’t. This seems consistent with the issue mentioned above about gene sets failing to consistently predict tumor severity. The predictive information is on a higher level than the individual genes.

The authors argue that one reason for the failure of predictive modeling in cancer research has been that investigators have relied too much on p values to say something about the clinical utility of their markers, when they should instead have focused more on the effect size, or the magnitude of difference in patient outcomes.

They also make a good point about reliability and reproducibility. I quote: “The literature of tumor biomarkers is plagued by publication bias and selective and/or incomplete reporting“. To help combat these biases, the authors (many of whom are associated with Sage Biosystems, who I have mentioned repeatedly on this blog) have made available an open model-assessment platform, including of course all the models from the paper itself, but which can also be used to assess your own favorite model.

A quotable Domingos paper

I’ve been (re-)reading Pedro Domingos’ paper, A Few Useful Things to Know About Machine Learning, and wanted to share some quotes that I like.

  • (…) much of the “folk knowledge” that is needed to successfully develop machine learning applications is not readily available in [textbooks].
  • Most textbooks are organized by representation [rather than the type of evaluation or optimization] and it’s easy to overlook the fact that the other components are equally important.
  • (…) if you hire someone to build a classifier, be sure to keep some of the data to yourself and test the classifier they give you on it.
  • Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs.
  • What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (…)
  • (…) strong false assumptions can be better than weak true ones, because a learner with the latter needs more data to avoid overfitting.
  • Even with a moderate dimension of 100 and a huge training set of a trillion examples, the latter cover only a fraction of about 10^-18 of the input space. This is what makes machine learning both necessary and hard.
  • (…) the most useful learners are those that facilitate incorporating knowledge.

Another interesting recent paper by Domingos is What’s Missing in AI: The Interface Layer.

Previously, Domingos has done a lot of interesting work on, for instance, why Naïve Bayes often works well even though its assumptions are not fulfilled, and why bagging works well. Those are just the ones I remember, I’m sure there is a lot more.

Not contagious after all?

(via Decision Science News) Ouch! A new paper titled “The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis” (published here and available in manuscript format on arXiv) has come out arguing very strongly against the conclusions drawn by Christakis and Fowler in a series of papers where they put forward the idea that things like obesity and smoking can be transmitted through social networks; a kind of “social contagion.” I blogged about these ideas a while back after both Wired and the New York Times had published articles on them. The title (harsh!) and the abstract speaks for itself:

The chronic widespread misuse of statistics is usually inadvertent, not intentional. We find cautionary examples in a series of recent papers by Christakis and Fowler that advance statistical arguments for the transmission via social networks of various personal characteristics, including obesity, smoking cessation, happiness, and loneliness. Those papers also assert that such influence extends to three degrees of separation in social networks. We shall show that these conclusions do not follow from Christakis and Fowler’s statistical analyses. In fact, their studies even provide some evidence against the existence of such transmission. The errors that we expose arose, in part, because the assumptions behind the statistical procedures used were insufficiently examined, not only by the authors, but also by the reviewers. Our examples are instructive because the practitioners are highly reputed, their results have received enormous popular attention, and the journals that published their studies are among the most respected in the world. An educational bonus emerges from the difficulty we report in getting our critique published. We discuss the relevance of this episode to understanding statistical literacy and the role of scientific review, as well as to reforming statistics education.

Cosma Shalizi has co-authored another paper (available here) which makes a similar point in a much more, let’s say, polite way. My impression is that Shalizi is both sharp and trustworthy (I’ve learned a lot about statistics from his blog) so I’m inclined to think he is on to something.

Biology-inspired algorithm design and non-obvious news discovery

There is a new Science article which seems really cool, although I haven’t had time to get past the paywall yet. The title is “A Biological Solution to a Fundamental Distributed Computing Problem” and the gist of it is pretty simple: a research group has found that an important procedure in distributed computing, “maximal independent set selection”, has been solved in a simple and efficient way in a kind of fly’s nervous system development. An algorithm based on the process that occurs in the fly’s immature nervous system can be directly applied to a network of sensors, for example.

In other news, Bradford Cross, who started the data-driven flight-delay prediction company FlightCaster, is starting a new company called Woven. It will be about discovering news you are interested in, and the platform will explicitly consider a conundrum that I’ve often been thinking about, which is the following (and possibly mentioned in some earlier blog post): Do you really want to read news that are always perfectly tailored to your interests? Wouldn’t this cause you to miss a lot of interesting information that you get from e.g. browsing the newspaper and “accidentally” reading about things you didn’t know about but which are actually kind of interesting? Bradford Cross mentions this in a recent interview and says that he started to “miss the serendipity that a newspaper provides”.  So far so good, but how to actually implement this kind of quasi-random content exposure (I tend to think of it as a kind of beneficial noise) into a news discovery service? I guess we will soon see what Woven has in mind.

Finally, the PayPal Developer Network (!) has a pretty nice tutorial about analyzing and visualizing the recently released World Bank data using tools like Java servlets, Google Charts and MySQL. The World Bank data would easily deserve a verbose blog post of its own (and I was planning one several months ago) but that will have to wait until I’ve taken a proper look at it.

The next big idea in language, history and the arts? Data.

This New York Times article is more than a month old, but it ties in quite nicely with the “Culturomics” I mentioned in the previous post.

Funny quote: “This alliance of geeks and poets has generated exhilaration and also anxiety.”

Games and competitions as research tools

The first high-profile paper describing crowdsourced research results has just been published in Nature. (I am excluding things like folding@home from consideration here, since in those cases the crowds are donating their processor cycles rather than their brainpower.) The paper describes how the game FoldIt (which I blogged about roughly a year ago) was used to refine predicted protein structures. This is an excerpt from the abstract:

Foldit players interact with protein structures using direct manipulation tools and user-friendly versions of algorithms from the Rosetta structure prediction methodology, while they compete and collaborate to optimize the computed energy. We show that top-ranked Foldit players excel at solving challenging structure refinement problems in which substantial backbone rearrangements are necessary to achieve the burial of hydrophobic residues. Players working collaboratively develop a rich assortment of new strategies and algorithms; unlike computational approaches, they explore not only the conformational space but also the space of possible search strategies. The integration of human visual problem-solving and strategy development capabilities with traditional computational algorithms through interactive multiplayer games is a powerful new approach to solving computationally-limited scientific problems.

So in other words, FoldIt tries to capitalize on intuitive or implicit human problem-solving skills to complement brute-force computational algorithms. Interestingly, all FoldIt players are credited as co-authors of the Nature, so technically I could count myself as one of them, seeing that I gave the game a try last year. (It’s a lot of fun, actually.)

I think games and competitions (which are almost the same thing, really) will soon be used a lot more than they are today in scientific research (and of course other areas like productivity, innovation and personal health management, too.) The Kaggle blog had an interesting post about competitions as real-time science. In a short time, Kaggle has set up several interesting prediction contests. The Eurovision Song Contest and Football World Cup contests were, I guess, mostly for fun. The interesting thing about the latter one, though, was that it was set up as a “Take on the quants” contest, where quantitative analysts from leading banks were pitted against other contestants – and they did terribly. Now the quants have a chance to redeem themselves in the INFORMS challenge, which is about their specialty area – stock price movements …

Anyway … the newest Kaggle contest is very interesting for me as a chess enthusiast. It is an attempt to improve on the age-old (well … I think it was introduced in the late 1960s) Elo rating formula, which is still used in official chess ranking lists. This system was invented by a statistician, Arpad Elo, based mostly on theoretical considerations, but it has done its job OK. The Elo ratings should ideally be able to predict results of games with a reasonable accuracy (as an aside, people have also often tried to use it to compare players from different epochs to each other, which is a futile exercise, but that’s a topic for another post), but where it really does that has not been very thoroughly analyzed. The Elo system also has some less well understood properties like an apparent “rating inflation” (which may or may not be an actual inflation). Some years ago, a statistician named Jeff Sonas started to develop his own system that he claimed was able to predict results of future games more accurately.

Now, Sonas (with Kaggle) has taken the next step, which is to arrange a competition to see if this will yield an even better system. The competitors get results of 65,000 recent games by top players and attempt to predict the outcome of a further 7,809 games. At the time of writing, there are already two rating systems that are doing better than Elo (see the leaderboard).

By the way, if you think chess research is not serious enough, Kaggle also has a contest about predicting HIV progression. I’m sure they have other scientific prediction contests lined up (I’ve noticed a couple of interesting – and lucrative – ones at Innocentive too.)

Sergey Brin’s new science and IBM’s Jeopardy machine

Two good articles from the mainstream press.

Sergey Brin’s Search for a Parkinson’s Cure deals with the Google co-founders quest to minimize his high hereditary risk of getting Parkinson’s disease (which he found out through a test from 23andme, the company his wife founded) while simultaneously paving the way for a more rapid way to do science.

Brin is proposing to bypass centuries of scientific epistemology in favor of a more Googley kind of science. He wants to collect data first, then hypothesize, and then find the patterns that lead to answers. And he has the money and the algorithms to do it.

This idea about a less hypothesis-driven kind of science, based more on observing correlations and patterns, surfaces once in a while. A couple of years ago, Chris Anderson received a lot of criticism for describing what is more or less the same idea in The End of Theory. You can’t escape the need for some sort of theory or hypothesis, and when it comes to something like Parkinson we just don’t know enough about its physiology and biology yet. However, I think Brin is right in emphasizing the need to get data and knowledge about diseases to circulate more quickly and to try to milk the existing data sets for what they are worth. If nothing else, his frontal attack on Parkinson’s may lead to improved techniques for dealing with über-sized data sets.

Smarter Than You Think is about IBM’s new question-answering system Watson, which is apparently now good enough to be put in an actual Jeopardy competition on US national TV (scheduled to happen this fall). It’s a bit hard to believe, but I guess time will tell.

Most question-answering systems rely on a handful of algorithms, but Ferrucci decided this was why those systems do not work very well: no single algorithm can simulate the human ability to parse language and facts. Instead, Watson uses more than a hundred algorithms at the same time to analyze a question in different ways, generating hundreds of possible solutions. Another set of algorithms ranks these answers according to plausibility; for example, if dozens of algorithms working in different directions all arrive at the same answer, it’s more likely to be the right one.

IBM plans to sell Watson-like systems top corporate customers for sifting through huge document collections.

Surprising self-experimentation

Seth Roberts, a pioneer in self-experimentation, has written an extremely interesting article called “The unreasonable effectiveness of my self-experimentation”  [PDF link]. In it, he tries to explain why his self-experiments were, in his opinion, so much more successful than a lot of conventional research. As he puts it himself in the paper:

[…] I was not an expert in what I studied and my research cost almost nothing. I did it in my spare time. In spite of this, my self-experimental research was far better than my mainstream research […]

Roberts describes how he started with self-experimentation by counting his pimples every day and trying a treatment to get rid of them. Eventually, he would discover surprising facts about himself, for example that drinking sugar water would tend to make him lose weight, and that eating breakfast would tend to make him wake up too early (but that standing up a lot would make him wake up later.) One of the main reasons he gives for his success is the freedom from academic pressure:

Myself-experimentation was not my job. For a long time, I did not expect to publish it; even later, after I decided to, I did not plan to use it to gain status within a profession. This freed me to (a) do whatever worked and (b) take as long as necessary. Professional scientists cannot try anything and cannot take as long as necessary. As Dyson […] said, ‘‘In almost all the varied walks of life, amateurs have more freedom to experiment and innovate [than professionals].”

The paper is interesting throughout.

Edit 2/6 2010: I found another paper by Roberts, a 61-page whopper called “Self-experimentation as a source of new ideas: Ten examples about sleep, mood, health and weight“, where he goes into a lot more detail (complete with pretty graphs plotted in R) about his various experiments. Definitely worth a look too.

Viewpoints on self-tracking

Here are some interesting articles on self-tracking published during the spring.

The data-driven life, a very meaty and well-researched article in The New York Times. It’s written by Gary Wolf, who is a co-host of the self-tracking blog, The Quantified Self. Standout quote:

With my spreadsheet, I inadvertently transformed myself into the mean-spirited, small-minded boss I imagined I was escaping through self-employment.

An interview with Nicholas Felton, who publishes a “personal annual report” crammed with visualizations of data he has collected about himself. Standout quote:

I think it would be more accurate to say that the age of the illusion of privacy is over. Your activities have long been transparent to credit card, mobile phone operators and others… now we have been given the tools to reveal this information socially (intentionally or unintentionally).

Numbers from the heart, a highly interesting essay by professor Ramesh Rao, who has done some heavy-duty signal analysis of his heart rate variability while meditating, running and sleeping, amongst other things. Standout quote:

The irony of getting attached to a practice that teaches detachment got me to take a look at Poincare plots of different styles of Yoga.

The essay also includes an interesting passage about entoptic phenomena (visual phenomena generated “internally” by the nervous system.)

Why I stopped tracking by Alexandra Carmichael is a powerful reminder of the potential drawbacks of self-tracking.

1.2 zettabyte of data

OK, so I was a bit slow to discover this, but The Economist has a special report on big data which is freely available online. That is, the individual articles are free, and a PDF compiling them is supposed to cost 3 GBP, but I was able to download it for free here without doing anything special.

A fun fact that I learned from this report is that the total amount of information in the world this year is projected to reach 1.2 zb (zettabyte) – which is 1.2×10^21 byte. How on earth did they come up with that figure…? Anyway, this report is worth a read, as it touches on things like business analytics, web mining, open government data and augmented cognition, while also giving some well deserved love to R and open source software.

Post Navigation