Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Subsemble: A Flexible Subset Ensemble Prediction Method

Abstract

Ensemble methods using the same underlying algorithm trained on different subsets of observations have recently received increased attention as practical prediction tools for massive data sets. We propose Subsemble, a general subset ensemble prediction method, which can be used for small, moderate, or large data sets. Subsemble partitions the full data set into subsets of observations, fits one or more user-specified underlying algorithm on each subset, and uses a clever form of V-fold cross-validation to output a prediction function that combines the subset-specific fits through a second user-specified metalearner algorithm. We give an oracle result that provides a theoretical performance guarantee for Subsemble. Through simulations, we demonstrate that Subsembles with randomly created subsets can be beneficial tools for small to moderate sized data sets, and often have better prediction performance than the same underlying algorithm fit just once on the full data set. We also describe how to include Subsembles as candidates in a SuperLearner library, providing a practical way to evaluate the performance of Subsembles relative to the same underlying algorithm fit just once on the full data set.

Since the final Subsemble estimator varies depending on the data within each subset, different strategies for creating the subsets used in Subsemble result in different Subsembles, which in turn have different prediction performance. To study the effect of subset creation strategies, we propose supervised partitioning of the covariate space to learn the subsets used in Subsemble. We highlight computational advantages of this approach, discuss applications to large-scale data sets, and develop a practical Supervised Subsemble method using regression trees to both create the covariate space partitioning, and select the number of subsets used in Subsemble. Through simulations and real data analysis, we show that this subset creation method can provide better prediction performance than the random subset version.

Finally, we develop the R package subsemble to make the Subsemble method readily available to both researchers and practitioners. We describe the subsemble function, discuss implementation details, and illustrate application of the Subsemble algorithm for prediction with subsemble through an example data set.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View