Practical advice for machine learning: bias, variance and what to do next
The online machine learning course given by Andrew Ng in 2011 (available here among many other places, including YouTube) is highly recommended in its entirety, but I just wanted to highlight a specific part of it, namely the “Practical advice part”, which touches on things that are not always included in machine learning and data mining courses, like “Deciding what do to do next” (the title of this lecture) or “debugging a learning algorithm” (the title of the first slide in that talk).
His advice here focuses on the concepts of the bias and variance in statistical learning. I had been vaguely aware of the concepts of “bias and variance tradeoff” and “bias/variance decomposition” for a long time, but I had always viewed those as theoretical concepts that were mostly helpful for thinking about the properties of learning algorithms; I hadn’t thought that much about connecting them to the concrete tasks of model development.
As Andrew Ng explains, bias relates to the ability of your model function to approximate the data, and so high bias is related to under-fitting. For example, a linear regression model would have high bias when trying to model a quadratic relationship – no matter how you set the parameters, you can’t get a good training set error.
Variance on the other hand is about the stability of your model in response to new training examples. An algorithm like K-nearest neighbours (K-NN) has low bias (because it doesn’t really assume anything special about the distribution of the data points) but high variance, because it can easily change its prediction in response to the composition of the training set. K-NN can fit the training data very well if K is chosen small enough (in the extreme case with K=1 the fit will be perfect) but may not generalize well to new examples. So in short, high variance is related to over-fitting.
There is usually a tradeoff between bias and variance, and many learning algorithms have a built-in way to control this tradeoff, like for instance a regularization parameter that penalizes complex models in many linear modelling type approaches, or indeed the K value in K-NN. A lot more about the bias-variance tradeoff can be found in this Andrew Ng lecture.
Now, based on these concepts, Ng goes on to suggest some ways to modify your model when you discover it has a high error on a test set. Specifically, when should you:
- Get more training examples?
(Answer: When you have high variance. More training examples will not fix a high bias, because your underlying model will still not be able to approximate the correct function.)
- Try smaller sets of features?
(Answer: When you have higher variance. Ng says, if you think you have high bias, “for goodness’ sake don’t waste your time by trying to carefully select the best features”)
- Try to obtain new features?
(Answer: Usually works well when you suffer from high bias.)
Now you might wonder how you know that you have either high bias or high variance. This is where you can try to plot learning curves for your problem. You plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes. (This of course requires you to randomly select examples from your training set, train models on them and assess the performance for each subset.)
In the typical high bias case, the cross-validation error will initially go down and then plateau as the number of training examples grow. (With high bias, more data doesn’t help beyond a certain point.) The training error will initially go up and then plateau at approximately the level of the cross-validation error (usually a fairly high level of error). So if you have similar cross-validation and training errors for a range of training set sizes, you may have a high-bias model and should look into generating new features or changing the model structure in some other way.
In the typical high variance case, the training error will increase somewhat with the number of training examples, but usually to a lower level than in the high-bias case. (The classifier is now more flexible and can fit the training data more easily, but will still suffer somewhat from having to adapt to many data points.) The cross-validation error will again start high and decrease with the number of training examples to a lower but still fairly high level. So the crucial diagnostic for the high variance case, says Ng, is that the difference between the cross-validation error and the training set error is high. In this case, you may want to try to obtain more data, or if that isn’t possible, decrease the number of features.
To summarize (using pictures from this PDF):
- Learning curves can tell you whether you appear to suffer from high bias or high variance.
- You can base your next step on what you found using the learning curves:
I think it’s nice to have this kind of rules of thumb when you get stuck, and I hope to follow up this post pretty soon with another one that deals with a relatively recent paper which suggests some neat ways to investigate a classification problem using sets of classfication models.