Game-based prediction model selection
In an interesting article, Pers and coauthors describe a sort of game to compare different machine learning algorithms. A recurring problem in studies using machine learning is that the researchers initially try out a wide variety of methods, then settle on one of them and report the results only for that algorithm, in a way pretending that the initial selection did not take place. The problem is that data-dependent optimization of the model, including the comparison of different methods, can influence the final model and potentially lead to erroneous conclusions. In addition, model selection is complicated by the fact that a specific researcher may be more skilled in using (parametrizing) a particular type of model. Moreover, different methods probably differ a lot in how difficult they are to parametrize.
The authors of this paper introduce a kind of game to facilitate a more objective comparison of machine learning methods:
In this article we present the VAML (Validation and Assessment of Machine Learning) game. The game aims at building a model for individual predictions based on complex data. The game starts by electing a referee who samples a reasonable number of bootstrap subsets or subsamples from the available data. Each player chooses a strategy for building a prediction model. The referee shares out the bootstrap samples and the players apply their strategies and build a prediction model separately in each bootstrap sample. The referee then uses the data not sampled in the respective bootstrap steps and a strictly proper scoring rule (…) to evaluate the predictive performance of the different models. For the interpretation of the results it is most important that all modeling steps are repeated in each bootstrap sample and that the same set of bootstrap samples is used for all strategies. These insights are formulated as fixed rules of the game.
The idea – as I interpret it – is to try to squeeze out as much as possible from each method by letting each player tune her own method to the best of her knowledge. This alleviates biases coming from a researcher being better at certain methods than at others or (perhaps unconsciously) simply preferring certain methods. Also, the referee probably ensures a more objective comparison than a single modeller would typically make.
It’s a bit funny (and perhaps typical) that the results obtained in this relatively objective study were only marginally better than a completely naive baseline model. Three well-reputed algorithms – random forests, support vector machines and LASSO regression – were compared, with random forests narrowly “winning”, although it was overfitting the most. Its training set error was almost zero, but the prediction error was about the same as the other methods and the naive model.