Follow the Data

A data driven blog

Archive for the month “June, 2012”

Prospects

I’ve been reading about two distinct predictive analytics related things called “Prospect” lately.

Kaggle Prospect

The first one is the Kaggle Prospect platform. Kaggle is, as is well known by know, a company that allows companies or other organizations to run online prediction contests, where they make a data set available and data geeks all over the world compete to construct the best predictive model. The best contestants (often) get money, and the company gets a high-quality model (I don’t think a Kaggle contest has ever failed to surpass the existing benchmark if one has existed).

However, a criticism that has been levelled at Kaggle is that the most difficult part of an analytics effort is to define the data set (what features to use, how to collect data etc.) and perhaps more importantly, the questions to ask, rather than building the actual predictive model, whereas Kaggle has provided contestants with well-defined questions and data sets. For example, Isomorphismes (whose Tumblr is excellent by the way), shocked me by writing that (s)he doesn’t like Kaggle (how can you not like Kaggle?) for this reason, and presented as support for this view the now much-repeated fact that Netflix never implemented the winning algorithm of the Netflix prize.

Well, now that criticism doesn’t quite apply anymore, because Kaggle Prospect is about defining new questions to ask from data sets and proposing new prediction contests based on sample data from organizations with large data sets. In other words, it’s a new type of Kaggle contest which is more about ideation than implementation. The first Prospect contest is about defining prediction contests based on patient medical record data from Practice Fusion, America’s fastest growing Electronic Health Record (EHR) community.

Prospect: Using multiple models to understand data

This is a bit related to my recent post Practical advice for machine learning: bias, variance and what to do next, where I wrote about Andrew Ng’s tips for arriving at the next step when you get stuck in a machine learning project. I came across a pretty interesting paper (PDF here) which describes some similar practical tips which are implemented in a platform called Prospect. (I haven’t tried it, and indeed I can’t find it available for download on the web, but that’s beside the point.) Basically the authors describe procedures for using sets of prediction models to solve two problems: (1) detecting label noise (identifying mislabeled training data) and (2) designing new features.

Detecting label noise. The authors introduce a type of scatter plot that shows incorrectness vs. label entropy for each example. The assumption is that you have built a large number of different prediction models (called configurations in Prospect; these can be different kinds of models, or the same kind of model with different parameter values, or a mix of both) and used k-fold cross-validation to get a prediction from each model for each training example. You can then calculate the incorrectness for an example simply as the percentage of configurations (models) that misclassified that example. The label entropy for an example is the entropy of the distribution of labels predicted by each configuration for that example.

The figure above is from the paper and shows some interesting regions in such an incorrectness/label-entropy scatter plot. At this point I will just quote from the paper:

“[The figure above] highlights three regions in the scatter plot. The canonical region contains examples that most configurations classify correctly (i.e., low-incorrectness, low-entropy). The unsure region contains examples for which different configurations generate widely varying predicted labels (i.e., high entropy). The confused region contains examples with high incorrectness and low entropy. These are the examples for which most configurations agree on a predicted label, but the predicted label is not the same as the ground truth label. These confused examples are of the most interest for detecting label noise, as the consistent
misclassification by many different models suggests the ground truth label may be incorrect.”

Generating new features. This part is about looking at how examples get classified by different configurations (models) and thereby getting a sense of which examples are difficult to classify and what new features might mitigate that difficulty. Focusing on examples that are hard to classify is reminiscent of a type of ensemble method called boosting, but that is a bit different from this. The emphasis here is on helping the human modeller understand the problem better. I quote from the paper again:

“[T]he key to discovering new discriminative features lies in understanding properties of the data that distinguish one class from another. Such understanding can be developed through deep analysis of the features or through analysis of how different examples are classified. Automated feature selection methods generally focus on analysis of the feature set, but this can be non-trivial for humans (especially in a high-dimensional feature space). On the other hand, looking at how different examples are classified is generally easier to comprehend and can provide insight into deficiencies of a feature set. Prospect provides
visualizations to examine aggregate statistics regarding how different examples are classified and misclassified.”

The incorrectness/label-entropy plot is used in this process as well, but now focusing on the “unsure” area where the label entropy is high; the examples there are misclassified by many configurations but are not predominately mistaken for any particular class. As the author puts it, the examples here “seem crucial to the process of feature discovery because their high entropy suggests that the available features cannot support reliable differentiation between classes. Focusing on the development of new features that are relevant to these examples should therefore provide new discriminative power to models.”

For more details, check out the paper (linked above.)

A Very Short History of Data Science

Reblogged from What's The Big Data?:

Click to visit the original post

I’m in the process of researching the origin and evolution of data science as a discipline and a profession. Here are the milestones that I have picked up so far, tracking the evolution of the term “data science,” attempts to define it, and some related developments.  I would greatly appreciate any pointers to additional key milestones (events, publications, etc.).

Read more… 2,495 more words

The word "datalogy" (mentioned in the beginning) is still used in Sweden; I used to teach courses in it!

Meetup groups for Big Data & Predictive Modeling and Quantified Self in Stockholm

Two interesting new meetup groups have formed in Stockholm (well, there are other interesting ones but for the purposes of this blog these two are the most exciting):

Fun!

Practical advice for machine learning: bias, variance and what to do next

The online machine learning course given by Andrew Ng in 2011 (available here among many other places, including YouTube) is highly recommended in its entirety, but I just wanted to highlight a specific part of it, namely the “Practical advice part”, which touches on things that are not always included in machine learning and data mining courses, like “Deciding what do to do next” (the title of this lecture) or “debugging a learning algorithm” (the title of the first slide in that talk).

His advice here focuses on the concepts of the bias and variance  in statistical learning. I had been vaguely aware of the concepts of “bias and variance tradeoff” and “bias/variance decomposition” for a long time, but I had always viewed those as theoretical concepts that were mostly helpful for thinking about the properties of learning algorithms; I hadn’t thought that much about connecting them to the concrete tasks of model development.

As Andrew Ng explains, bias relates to the ability of your model function to approximate the data, and so high bias is related to under-fitting. For example, a linear regression model would have high bias when trying to model a quadratic relationship – no matter how you set the parameters, you can’t get a good training set error.

Variance on the other hand is about the stability of your model in response to new training examples. An algorithm like K-nearest neighbours (K-NN) has low bias (because it doesn’t really assume anything special about the distribution of the data points) but high variance, because it can easily change its prediction in response to the composition of the training set. K-NN can fit the training data very well if K is chosen small enough (in the extreme case with K=1 the fit will be perfect) but may not generalize well to new examples. So in short, high variance is related to over-fitting.

There is usually a tradeoff between bias and variance, and many learning algorithms have a built-in way to control this tradeoff, like for instance a regularization parameter that penalizes complex models in many linear modelling type approaches, or indeed the K value in K-NN. A lot more about the bias-variance tradeoff can be found in this Andrew Ng lecture.

Now, based on these concepts, Ng goes on to suggest some ways to modify your model when you discover it has a high error on a test set. Specifically, when should you:

- Get more training examples?

(Answer: When you have high variance. More training examples will not fix a high bias, because your underlying model will still not be able to approximate the correct function.)

- Try smaller sets of features?

(Answer: When you have higher variance. Ng says, if you think you have high bias, “for goodness’ sake don’t waste your time by trying to carefully select the best features”)

- Try to obtain new features?

(Answer: Usually works well when you suffer from high bias.)

Now you might wonder how you know that you have either high bias or high variance. This is where you can try to plot learning curves for your problem. You plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes. (This of course requires you to randomly select examples from your training set, train models on them and assess the performance for each subset.)

In the typical high bias case, the cross-validation error will initially go down and then plateau as the number of training examples grow. (With high bias, more data doesn’t help beyond a certain point.) The training error will initially go up and then plateau at approximately the level of the cross-validation error (usually a fairly high level of error). So if you have similar cross-validation and training errors for a range of training set sizes, you may have a high-bias model and should look into generating new features or changing the model structure in some other way.

In the typical high variance case, the training error will increase somewhat with the number of training examples, but usually to a lower level than in the high-bias case. (The classifier is now more flexible and can fit the training data more easily, but will still suffer somewhat from having to adapt to many data points.) The cross-validation error will again start high and decrease with the number of training examples to a lower but still fairly high level. So the crucial diagnostic for the high variance case, says Ng, is that the difference between the cross-validation error and the training set error is high. In this case, you may want to try to obtain more data, or if that isn’t possible, decrease the number of features.

To summarize (using pictures from this PDF):

- Learning curves can tell you whether you appear to suffer from high bias or high variance.

- You can base your next step on what you found using the learning curves:

I think it’s nice to have this kind of rules of thumb when you get stuck, and I hope to follow up this post pretty soon with another one that deals with a relatively recent paper which suggests some neat ways to investigate a classification problem using sets of classfication models.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 47 other followers