Preventing Multiple Comparisons Problems in Data Exploration and Machine Learning
- Author(s): Koulouris, Nikolaos
- Advisor(s): Papakonstantinou, Yannis
- et al.
More data means more opportunity for a researcher to test more hypotheses until he discovers an interesting finding. This increases the probability of arriving at a false conclusion purely by chance and is known as the multiple comparisons problem. Data exploration systems facilitate exploring big data by automatically testing thousands of hypotheses in order to find the most interesting ones. In machine learning analysts repeatedly test a model's performance on a holdout dataset until they find the one with the best performance. Auto-ML systems try to automate this model selection process. In both cases, testing for more things means a higher probability of making a statement purely by chance.
This dissertation examines how the multiple comparisons problem appears in the field of data exploration and machine learning. In both cases we propose techniques that exploit some structure that appears in the field to improve upon existing techniques and reduce the consequences of multiple comparisons.
We present VigilaDE, the first data exploration system that utilizes the hierarchical structure of the data in order to control false discoveries. A plethora of real-world datasets already have domain-specific hierarchies that describe the relationship between variables. VigilaDE utilizes these hierarchies to guide the exploration towards interesting discoveries while controlling false discoveries and, as a result, increasing statistical power. Through extensive experiments with real-world data, simulations and theoretical analysis we show that our data exploration algorithms can find up to 2.7x more true discoveries in the data against the baseline while controlling the number of false discoveries.
In machine learning, the consequence of testing multiple different models is overfitting. We present an experimental analysis of ThresholdOut, the state of the art algorithm for avoiding overfitting a holdout dataset. The main limitation of ThresholdOut is setting its parameters. We present Auto-Set, an automated way to set its parameters for feature selection. Specifically in feature selection the order of the models that we test on a holdout dataset has a very specific structure. We utilize this structure in Auto Adjust Threshold, a novel feature selection algorithm that avoids overfitting a holdout dataset and show that it outperforms existing algorithms.