Random forests (RFs) and generalized linear models, GLMs (often in some regularized form – ridge regression, lasso, elastic net) are currently popular predictive modelling methods, and with good reason. At some conference – I think it was Strata 2012 – they even had a special session about “the only two predictive modelling techniques you will need”, which were these two. They tend to be accurate (especially random forests have this reputation) and they can quantify the importance of different features, although the regression models have an edge over the random forests here as the connection between regression coefficients and the prediction is quite natural.
Earlier this year, a group from UCLA published a paper where they combine RFs and GLMs into RGLMs, random generalized linear models, which they claim leads to classifiers with both extremely high accuracy and very good interpretability.
It seems this might be a natural idea, so why has no one come up with it before? The authors explain in an instructive presentation that in this case, “two wrongs make a right”. They have combined two ideas that are both bad in themselves, but somehow work well in combination.
Let’s backtrack a little and see how an RGLM works. At the bottom, there is a “standard” generalized linear model, which is repeatedly fitted to different bootstrap samples (samples taken with replacement) from the training set. In other words, the GLMs are “bagged”. But like in random forests, the training is not only done on random (overlapping) subsets of the training examples, but also on random subsets of the features. A fancier term for this is “random subspace projection.”
So far, it’s “just” a GLM trained in the same way as a random forest. But there is an additional twist, which is that the GLMs are built up using forward selection. Also, feature interactions are allowed in the GLMs.
Now we can go back to the “two wrongs make a right” claim. The authors argue that “forward selection is typically a bad idea since it overfits the data and thus degrades the prediction accuracy of a single GLM predictor” (page 18 in the paper), and that bagging a full logistic regression model (a form of GLM) is “also a bad idea since it leads to a complicated (ensemble) predictor without clear evidence for increased accuracy”. Nevertheless, these ideas in combination (plus the random feature subset selection) apparently give a superior classifier because – if I understand the authors correctly – instability of classifiers (which results from forward selection of variables, and from random subset selection) is actually beneficial for bagging.
The authors show that RGLMs have excellent prediction performance on a range of data sets (notably, dozens of genomic high-dimensional data sets) and I am very eager to try this method, which is available as an R package (randomGLM) (see http://labs.genetics.ucla.edu/horvath/htdocs/RGLM/ for more).
I am not sure I fully buy into the claim that the RGLM is very interpretable – because it is an ensemble method, you do not get anything as straightforwardly interpretable as a regression coefficient for each variable. What you do get is (in many cases) a very simple model – RGLM uses “predictor thinning” based on a variable important measure to arrive at a slimmed-down set of predictors which is easier to interpret by virtue of its compactness. In a way I guess this is similar to the lasso, where one hopes for a small set of variables with non-zero coefficients.
I haven’t discussed how the RGLM combined evidence from its sub-classifiers or how the variable importance is calculated – it’s all in the paper and I need to go to bed!