Center for Bioinformatics and Molecular Biostatistics
Machine Learning Benchmarks and Random Forest Regression
- Author(s): Segal, Mark R
- et al.
Breiman (2001a,b) has recently developed an ensemble classification and regression approach that displayed outstanding performance with regard prediction error on a suite of benchmark datasets. As the base constituents of the ensemble are tree-structured predictors, and since each of these is constructed using an injection of randomness, the method is called ‘random forests’. That the exceptional performance is attained with seemingly only a single tuning parameter, to which sensitivity is minimal, makes the methodology all the more remarkable. The individual trees comprising the forest are all grown to maximal depth. While this helps with regard bias, there is the familiar tradeoff with variance. However, these variability concerns were potentially obscured because of an interesting feature of those benchmarking datasets extracted from the UCI machine learning repository for testing: all these datasets are hard to overfit using tree-structured methods. This raises issues about the scope of the repository.
With this as motivation, and coupled with experience from boosting methods, we revisit the formulation of random forests and investigate prediction performance on real-world and simulated datasets for which maximally sized trees do overfit. These explorations reveal that gains can be realized by additional tuning to regulate tree size via limiting the number of splits and/or the size of nodes for which splitting is allowed. Nonetheless, even in these settings, good performance for random forests can be attained by using larger (than default) primary tuning parameter values.