Follow the Data

A data driven blog

Archive for the tag “prediction”

Explaining predictions and imbalanced data sets

In many applications it is of interest to know why a specific prediction was made for an individual instance. For instance, a credit card company might want to be able to explain why a customer’s card was blocked even though there was no actual fraud. Obviously, this kind of “reverse engineering” of predictions is very important for improving the predictive model, but many (most?) machine learning methods don’t have a straightforward way of explaining why a certain specific prediction was made, as opposed to displaying a “decision surface” or similar that describes in general what matters for the classification. There are exceptions, like decision trees, where the resulting classifier shows in a transparent way (using binary tests on the features) what the prediction will be for a specific example, and naïve Bayes classifiers, where it is straightforward to quantify how much each feature value contributes to the prediction.

Someone asked a question about this on MetaOptimize QA and the answers contained some interesting pointers to published work about this interesting problem. Alexandre Passos recommends two papers, How To Explain Individual Classification Decisions (pdf link) by Baehrens et al., and An Efficient Explanation of Individual Classifications
using Game Theory (pdf link) by Strumbelj and Kononenko. The Baehrens et al. paper defines something called an “explanation vector” for a data point, which is (if I have understood the paper correctly) a vector that has the same dimension as the data point itself (the number of features) and that points towards the direction of maximum “probability flow” away from the class in question. The entries in this vector that have large absolute values correspond to features that have a large local influence on which class is predicted. The problem is that this vector typically cannot be calculated directly from the classification model (except in some cases like Gaussian Processes), so it has to be estimated using some sort of smoothing of the data; in the paper they use Parzen windows.

I would really like to have code to do this (ideally, an R package) but couldn’t find any in the paper.

The Strumbelj paper uses a completely different approach which I frankly can’t really wrap my head around, but is based on game theory, specifically the idea that an explanation of a classifier’s prediction can be treated as something called a “coalitional form game” where the instance’s feature values form a “coalition” which causes a change in the classifier’s prediction. This lets them use something called the “Shapley value” to assess the contributions of each feature.

Again, it would be really nice to have the code for this, even though the authors state that “the explanation method is a straightforward Java implementation of the equations presented in this paper.”

On another “practical machine learning tips” note, the newish and very good blog linked to a very interesting article, Learning from Imbalanced Data. In working with genome-scale biological data, we often encounter imbalanced data scenarios where the positive class may contain, say, 1,000 instances, and the negative class 1,000,000 instances. I knew from experience that it is not straightforward to build and assess the performance of classifiers for this type of data sets (for example, concepts like accuracy and ROC-AUC become highly problematic), but like Carl at I was surprised by the sheer variety of approaches outlined in this review. For instance, there are very nice expositions of the dangers of under- and oversampling and how to perform more informed versions of those. Also, I had realized that cost-sensitive evaluation methods could be useful (it may be much more important to classify instances in the rare class correctly, for example) but before reading this review I hadn’t thought about how to integrate cost-sensitivity into the actual learning method.

Lessons from medical predictive analytics competititions

Presentations from the organizers and best performing teams of the Systems Biology Improver Challenge, which I have covered before, have been put online. There is a ton of interesting stuff here for people like me who are interested in prediction contests, machine learning, and/or medical bioinformatics.

To rewind a little bit, the SBV Improver Challenge was a set of competitions somewhat similar to Kaggle’s competitions (indeed, there is a presentation by Will Cukierski of Kaggle among these presentations) in that participants needed to build a predictive model for classifying diseased versus healthy samples for four diseases based on gene expression data (simplifying a little bit), and then predict the disease status for an unlabeled test set. The competitions differed from Kaggle competitions in that the training data sets were not fixed – the organizers just pointed to some suggested public data sets, which I think was a nice way to do it.

The point of these sub-competitions was to establish whether gene expression profiles are truly predictive of disease states in general or just in specific cases.

Anyway – I enjoyed browsing these presentations a lot. Some take-home messages:

  • Different diseases vary immensely in how much predictive signal the gene expression data contains. Psoriasis was much easier to classify correctly based on these data than the other diseases: multiple sclerosis, COPD and lung cancer. The nature of the disease was an incomparable more important variable than anything related to normalization, algorithms, pre-processing etc.
  • It was suggested that providing the whole test set at once may be a bad idea, because it may reveal information (such as cluster structure) that would not be known in a real-life scenario (for instance, if you were to go to the doctor to get a gene expression measurement from your own tissue.) It was suggested that next time, the platform would just provide one data point at a time to be predicted. Indeed, I have thought a lot about this “single-sample classification” problem in connection with gene expression data lately. As anyone who has worked with (microarray or RNA-seq) gene expression data knows, there are severe experimental batch effects in these data that are usually removed by normalizing all data points together, but this cannot always be done in practice.
  • In these competitions, none of the alleged superiority of random forest classifiers or GLMs (cf. was in evidence, with “simple methods” such as linear discriminant analysis and Mann-Whitney tests for feature selection performing the best. Yes, I know there is probably no statistical significance here due to the small sample size …
  • Speaking of random forests and GLMs, I was surprised to learn about the RGLM (Random Generalized Linear Model), which is kind of a mix of the two; generalized linear models are built on random subsets of the training examples and the features (like in random forest classifiers) and predictions of many of these models are aggregated to get a final prediction. The presentation is here.

The “lessons learned” presentation discusses (most of) these points and more, and is interesting throughout.



Summer reading

Some nice reading for the summer (in case of a rainy day of course):

  • Prediction, Learning and Games (PDF link) – Nice textbook on prediction. Via @ML_hipster (worth following on Twitter if you like @bigdatahipster and/or authentic, hand-crafted decision trees)
  • Data Science 101, a very nice blog which points to a multitude of resources
  • School of Data and the accompanying Data Wrangling Handbook
  • Agile Data by Russell Jurney (who is well worth following on Twitter and especially Quora). This book isn’t finished yet but can be viewed in its current state of development at the given link, which is within the Open Feedback Publishing System at O’Reilly Media. So you can, on one hand, read the book (or parts of it) for free before publication, and on the other hand, provide feedback and thus shape the contents of the book.
  • (edit 17/7 2012) Might as well throw this one in: Data Jujitsu: The Art of Turning Data into Product by DJ Patil, a free O’Reilly Radar report (epub/PDF/mobile).

New analysis competitions

Some interesting competitions in data analysis / prediction:

Kaggle is managing this year’s KDD Cup, which will be about Weibo, China’s rough equivalent to Twitter (with more support for adding pictures and comments on posts, it’s more like a hybrid between Twitter and Facebook maybe). There will be two tasks, (1) predicting which users a certain user will follow (all data being anonymized, of course), and (2) predicting click-through rate in online computational ad systems. According to Gordon Sun, chief scientist at Tencent (the company behind Weibo), the data set to be used is the largest one ever to have been released for competitive purposes.

CrowdAnalytix, an India-based company with a business idea similar to Kaggle’s, has started a fun quickie competition about sentiment mining. Actually the competition might already be over as it ran for just 9 days starting 16/2. The input consists of comments left by visitors to a major airport in India, and the goal is to identify and compile actionable and/or interesting information, such as what kind of services visitors think are missing.

The Clarity challenge is, for me, easily the most interesting challenge of the three, in that it concerns the use of genomic information in healthcare. This challenge (with a prize sum of $25,000) is, in effect, crowdsourcing genomic/medical research (although only 20 teams will get selected to participate). The goal is to identify and report on potential genetic features underlying medical disorders in three children, given the genome sequences of the children and their parents. These genetics features are presently unknown, which is why this competition really represents something new in medical research. I think this is a very nice initiative, in fact I had thought of initiating something similar at my own institute where I work, but this challenge is much better than what I had in mind. It will be very interesting to see what comes out of it.

Google Prediction API open to all

I’ve been eagerly waiting to use the Google Prediction API ever since it was announced, and now (since sometime in May) it’s open for everyone who has a Google account (and a credit card). Previously, you had to be able to provide a U.S. mailing address.

Google’s Prediction API is basically a nice way to run your classification and/or prediction tasks through Google’s black-box set of machine learning tools. The way it works is that you upload your training data to Google Storage, which is something like Google’s version of Amazon’s S3: a cloud-based storage system where you store your data in “buckets”. (Google Storage, like S3, uses the term bucket and, also like S3, requires that bucket names only use lower-case letters.) You can activate both Google Storage and the Prediction API from the Google APIs Console. This is also where you will find (click “API access” on the left hand menu) the access key that you will need to run prediction tasks. You’ll have to give credit card details to pay for potential future usage.

The training examples that you put in Storage need to be formatted according to the specification in the Developer’s Guide. Once they have been uploaded, you can train a model on the uploaded data, make predictions about new examples, update existing models and more using one of the client libraries or even simpler, just by copying some of the bash scripts shown on the same page (hidden behind ‘+’ signs which can be expanded.) For these bash scripts to work as written on that page, you need to paste your API key into a file called ‘googlekey’ located in the directory from where you are running the script.

I used this walkthrough example about cancer classification from gene expression data to get up to speed on how Google Prediction API works. Now I’m thinking about what data to throw at it next. Perhaps it would be fun to input some Kaggle contest data sets into it as a kind of “Google baseline” predictor? 🙂

Web search based prediction works well for a first-pass analysis

A simple but interesting study, Predicting consumer behavior with Web search, was just published in PNAS. Inspired by Google Flu Trends and other ways of “predicting the present” by tracking web searches in (almost) real time, the article authors try to compare these methods to “baseline” predictors that use other available sources of information. The results indicate that the search-based methods aren’t necessarily better than baseline – sometimes they are clearly worse – but the less prior information there is, the better the search based method does compared to the baseline predictor. For example, the revenues of video game sequels are well predicted by a baseline model looking at, among other things, the revenue of the predecessor, but the revenues of non-sequel games are hard to predict by the baseline predictor, whereas the search-based prediction works well. Combining both the baseline and the search-based predictors typically results in a modest increase in accuracy above the best of the two. This suggests that search-based prediction is pretty robust in the sense that it can be used in the absence of relevant information and still give reasonable results. Therefore it may be useful in various kinds of first-pass analysis before building a more accurate predictor based on many different information sources. Another advantage of search-based methods that the authors don’t really go into that much is the detection of turning points – like when a trend starts to take off. For example, in their flu prediction examples, an auto-regressive model (a model that uses a weighted average of the last few time points) tracks the actual flu outbreaks almost as well as the search-based model. However, looking closely at the plots, the auto-regressive model always lags a bit behind the search-based predictor. This makes sense, as it is always basing its prediction on the previous time points,and  so takes some time to catch on to the fact that the situation has changed radically.

Edit 30/9 2010

I just realized this is the first major publication where I’ve seen stated in the Methods section that the authors used Hadoop and Pig to analyze their data. Yet, Benjamin Black tweets that Hadoop is already legacy. Things move fast.

Links without a common theme

  • Are we ready for a true data disaster? Interesting Infoworld article that talks about possibilities for devastating “data spills” that could have effects as bad as the oil spill, or worse.
  • Monkey Analytics – a “web based computation tool” that lets users run R, Python and Matlab commands in the cloud.
  • Blogs and tweets could predict the future. New Scientist article that mentions Google’s study from last year where they tried to use search data to predict various economic variables. A lot of organizations have seized upon that idea, and lately we have seen examples such as Recorded Future, a company that attempts to “mine the future” using future-related online text sources. Google famously used the “predictions from search data” idea to predict flu outbreaks. One of the interesting things here, I think, is that people’s searches (which could be viewed naïvely as ways to obtain data) actually become data in themselves; data that can be used as predictors in a statistical models. The Physics of Data is an interesting video where Google’s Marissa Mayer talks about this topic and a lot of other googly stuff (I don’t really get the name of the presentation though, despite her attempt to justify it in the beginning …).
  • Wikiposit aims to be a “Wikipedia of numerical data.” It aggregates thousands of public data sets (currently 110,000) into a single format and offers a simple API to access them. As of now, it only supports time series data, mostly from the financial domain.

Machine learning competitions and algorithm comparisons

Tomorrow, 29 May 2010, a lot of (European) people will be watching the Eurovision Song Contest to see which country will take home the prize. Personally, I don’t really care about who wins the contest itself, but I do care (somewhat) about which predictor will win the Eurovision Voting Forecast competition arranged by Kaggle describes itself as “a platform for data mining, bioinformatics and forecasting competitions“. It provides an objective framework for comparing techniques and “allows organizations to have their data scrutinized by the world’s best statisticians.”

Contests like this are fun, but they can also have more serious aims. For instance, Kaggle also hosts a competition about predicting HIV progression based on the virus’ DNA sequence. The currently leading submission has already improved on the best methods reported in literature, and so a post at Kaggle’s No Free Hunch blog asks whether competitions might be the future of research. I think they may well be, at least in some domains. A few months back, I mentioned an interesting challenge at Innocentive which is essentially a very difficult pure research problem, and it will be interesting to learn how the winning team there did it (if any details are disclosed). (I signed up for this competition myself, but haven’t been able to devote more than one or two hours to it so far, unfortunately.)

There are other platforms for prediction competitions as well, for instance TunedITs challenge platform, which allows university teachers to “make their courses more attractive and valuable, through organization of on-line student competitions instead of traditional assignments.” TunedIT also has a research platform where you can run automated tests on machine learning algorithms and get reproducible experimental results. You can also benchmark results against a knowledge base or contribute to and use a repository of various data sets and algorithms.

Another initiative for serious evaluation of machine learning algorithms in various problem domains is MLcomp. Here, you can upload your own datasets (or use pre-loaded ones) and run existing algorithms on them through a web interface. MLcomp then reports various metrics that allow you to compare different methods.

By the way, 22 teams participated in Kaggle’s Eurovision challenge, and Azerbaijan is the clear favorite, having been picked as the winner by 14 teams. Let’s see how it goes tomorrow.

Data mine your way to $100,000

The crowdsourcing company Innocentive, which serves up tough (usually scientific or technical) problems for anyone to solve against a monetary reward, has put up a predictive analytics challenge for which the reward is a whopping USD100,000. You’ll have to register to find out what the challenge is about and in order to download the data, but I don’t think I’m giving away too much by saying that it’s a life science/bioinformatics challenge, albeit one which could be solved without much knowledge of biology. Innocentive has apparently implemented a new system for testing your predictive models against a reference data set, kind of like NetFlix, and they also have a leader board showing the currently best models (measured by R^2 of predictions vs. test set using Spearman’s rank correlation). Of course, the final submission from each contestant will be scored on a completely separate test set.

TR personalized medicine briefing

MIT’s Technology Review magazine has a briefing on personalized medicine. It’s worth a look, although it’s quite heavily tilted towards DNA sequencing technology (which I am interested in, but there is a lot more to personalized medicine). Not surprisingly, one of the articles in the briefing makes the point that the biggest bottleneck in personalized medicine will be data analysis, the risk being that “…we will end up with a collection of data … unable to predict anything.” (As an aside, I would be moderately wealthy if I had a euro for each time I’d read the phrase “drowning in data”, which appears in the article heading. I think I even rejected that as a name for this blog. It would be nice to see someone come up with a fresh alternative verb to “drowning” …)

Technology Review also has a piece on how IBM has started to put their mathematicians to work in business analytics. They mention a neat technique I hadn’t been aware of: “…they used a technique called high-quantile modeling–which tries to predict, say, the 90th percentile of a distribution rather than the mean–to estimate potential spending by each customer and calculate how much of that demand IBM could fulfill“.

The last part of the article talks about a very interesting problem: how to model a system where output from the model itself affects the system, or as the article puts it “…situations where a model must incorporate behavioral changes that the model itself has inspired“. I’m surprised the article doesn’t mention the obvious applicability of this to the stock market, where of course thousands of professional and amateur data miners use prediction models (their own and others’) to determine how they buy and sell stocks. Instead, its example comes from traffic control:

For example, […] a traffic congestion system might use messages sent to GPS units to direct drivers away from the site of a highway accident. But the model would also have to calculate how many people would take its advice, lest it end up creating a new traffic jam on an alternate route.

Post Navigation