I’ve been reading about two distinct predictive analytics related things called “Prospect” lately.
The first one is the Kaggle Prospect platform. Kaggle is, as is well known by know, a company that allows companies or other organizations to run online prediction contests, where they make a data set available and data geeks all over the world compete to construct the best predictive model. The best contestants (often) get money, and the company gets a high-quality model (I don’t think a Kaggle contest has ever failed to surpass the existing benchmark if one has existed).
However, a criticism that has been levelled at Kaggle is that the most difficult part of an analytics effort is to define the data set (what features to use, how to collect data etc.) and perhaps more importantly, the questions to ask, rather than building the actual predictive model, whereas Kaggle has provided contestants with well-defined questions and data sets. For example, Isomorphismes (whose Tumblr is excellent by the way), shocked me by writing that (s)he doesn’t like Kaggle (how can you not like Kaggle?) for this reason, and presented as support for this view the now much-repeated fact that Netflix never implemented the winning algorithm of the Netflix prize.
Well, now that criticism doesn’t quite apply anymore, because Kaggle Prospect is about defining new questions to ask from data sets and proposing new prediction contests based on sample data from organizations with large data sets. In other words, it’s a new type of Kaggle contest which is more about ideation than implementation. The first Prospect contest is about defining prediction contests based on patient medical record data from Practice Fusion, America’s fastest growing Electronic Health Record (EHR) community.
Prospect: Using multiple models to understand data
This is a bit related to my recent post Practical advice for machine learning: bias, variance and what to do next, where I wrote about Andrew Ng’s tips for arriving at the next step when you get stuck in a machine learning project. I came across a pretty interesting paper (PDF here) which describes some similar practical tips which are implemented in a platform called Prospect. (I haven’t tried it, and indeed I can’t find it available for download on the web, but that’s beside the point.) Basically the authors describe procedures for using sets of prediction models to solve two problems: (1) detecting label noise (identifying mislabeled training data) and (2) designing new features.
Detecting label noise. The authors introduce a type of scatter plot that shows incorrectness vs. label entropy for each example. The assumption is that you have built a large number of different prediction models (called configurations in Prospect; these can be different kinds of models, or the same kind of model with different parameter values, or a mix of both) and used k-fold cross-validation to get a prediction from each model for each training example. You can then calculate the incorrectness for an example simply as the percentage of configurations (models) that misclassified that example. The label entropy for an example is the entropy of the distribution of labels predicted by each configuration for that example.
The figure above is from the paper and shows some interesting regions in such an incorrectness/label-entropy scatter plot. At this point I will just quote from the paper:
“[The figure above] highlights three regions in the scatter plot. The canonical region contains examples that most configurations classify correctly (i.e., low-incorrectness, low-entropy). The unsure region contains examples for which different configurations generate widely varying predicted labels (i.e., high entropy). The confused region contains examples with high incorrectness and low entropy. These are the examples for which most configurations agree on a predicted label, but the predicted label is not the same as the ground truth label. These confused examples are of the most interest for detecting label noise, as the consistent
misclassification by many different models suggests the ground truth label may be incorrect.”
Generating new features. This part is about looking at how examples get classified by different configurations (models) and thereby getting a sense of which examples are difficult to classify and what new features might mitigate that difficulty. Focusing on examples that are hard to classify is reminiscent of a type of ensemble method called boosting, but that is a bit different from this. The emphasis here is on helping the human modeller understand the problem better. I quote from the paper again:
“[T]he key to discovering new discriminative features lies in understanding properties of the data that distinguish one class from another. Such understanding can be developed through deep analysis of the features or through analysis of how different examples are classified. Automated feature selection methods generally focus on analysis of the feature set, but this can be non-trivial for humans (especially in a high-dimensional feature space). On the other hand, looking at how different examples are classified is generally easier to comprehend and can provide insight into deficiencies of a feature set. Prospect provides
visualizations to examine aggregate statistics regarding how different examples are classified and misclassified.”
The incorrectness/label-entropy plot is used in this process as well, but now focusing on the “unsure” area where the label entropy is high; the examples there are misclassified by many configurations but are not predominately mistaken for any particular class. As the author puts it, the examples here “seem crucial to the process of feature discovery because their high entropy suggests that the available features cannot support reliable differentiation between classes. Focusing on the development of new features that are relevant to these examples should therefore provide new discriminative power to models.”
For more details, check out the paper (linked above.)