Tree algorithms for larger-than-memory data sets
We came across a couple of interesting papers about decision tree algorithms for data sets that don’t fit into memory. In general, we have been thinking a bit lately about this type of approximate data analysis where various tricks can be used to summarize relevant characteristics of a huge data set. The Mining of Massive Datasets book (free PDF) has some nice material to chew over when it comes to approximate “streaming” algorithms (e g chapter 4).
The first of the “streaming decision tree” methods came to our attention via a post on the Revolution Analytics blog, rxDTree(): a new type of tree algorithm for big data. It links to a paper from 2010, A streaming parallel decision tree algorithm (note: PDF link) by Ben-Haim and Yom-Tov, which describes a tree algorithm that addresses the problem of sorting a very large number of values (to determine split points in the decision tree algorithm – this becomes non-trivial when dealing with extremely big data sets) by instead calculating histograms from parta of the data in parallel and finally merging the histograms before building the tree.
The other paper is Ensembles on Random Patches (also PDF link) by Louppe and Geurts, which reminds me of the random forest algorithm in that it uses subsets of both instances and features to build an ensemble model, but in contrast to RF it does this in order to reduce the memory footprint. The abstract of the paper:
In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data, distributed databases and embedded systems. We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. We carry out an extensive and systematic evaluation of this method on 29 datasets, using decision tree-based estimators. With respect to popular ensemble methods, these experiments show that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained.