This thesis is a study of boosting. It consists of two parts. In the first part, we develop a new way of parallelizing boosting. In the second part, we apply boosting to the problem of bathymetry data editing and study the issues of experimental design for diverse datasets.
The first part of this thesis presents a parallel boosting algorithm that achieves a significant speedup while keeping a small memory footprint. It combines two novel techniques. One is a method for parallelization with weak synchronous requirement which we call "Tell Me Something New" (TMSN). The other is a method we call stratified weighted sampling that significantly reduces the I/O load of boosting.
We implemented our algorithm using the Rust programming language and demonstrated its superior performance when memory size is limited. Our experiments show a 10-100x speedup over two of the popular implementations of boosted trees, XGBoost and LightGBM, when training data is too large to fit in memory.
The second part of this thesis involves a project that uses boosting as an aid in the bathymetry data editing. Bathymetry is a study of the depths and shapes of underwater terrain. The objective of our project is to create a binary classifier that separates the correct depth measures from the incorrect ones. Our experimental results challenge the standard assumption that training and testing samples are both drawn i.i.d. from a fixed distribution.
First, we examine spurious correlation, where some training and testing samples are similar to each other because they are duplicates, near-duplicates, or sequentially collected. A simple memorization-based model could achieve a low in-sample validation error in these cases, but its out-of-sample test error is much worse.
Second, we examine data diversity, in which datasets are not diverse enough to be representative. It happens when the feature dimension is so high that collecting a representative sample is difficult. The models trained in these cases perform poorly on a new test set collected separately because of the domain shift problem.
Lastly, we propose an alternative framework from the perspective of experimental design and present a case study with modeling bathymetry data editing.