This dissertation is on high dimensional data and their associated regularization through dimension reduction and penalization.
We start with two real world problems to illustrate the practical difficulties and remedies in analyzing high dimensional data. In Chapter 1, we are tasked with modeling and predicting the U.S. stock market, where the number of stocks far exceeds the number of days relevant to the current market. Through an existing statistical arbitrage framework, we reduce the dimension of our problem with the use of correspondence analysis. We develop a data driven regression model and highlight some common statistical methods that improve our predictions. In Chapter 2, we attempt to detect and predict system anomalies in large enterprise telephony systems. We do this by processing large amounts of unstructured log files, again with dimension reduction methods, allowing effective visualization and automatic filtering of results.
We then move on to more general methodology and analysis in high dimensions.
In Chapter 3, we consider regularization methods, often used in dealing with high dimensional data, and tackle the problem of selecting the associated regularization parameter. We introduce SSCV, a selection criterion based on statistical stability, but also incorporating model fit, and show that it can often outperform the popular cross validation. Finally, we explore robust methods in the high dimensional setting in Chapter 4. We focus on the relative performance and distributional robustness of the estimators optimizing L1 and L2 loss functions respectively. We verify some expected results and also highlight cases where results from classical asymptotics fail, setting the stage for future theoretical work.