This dissertation considers the topics of prediction and variable selection for the applied political scientist, particularly in the context of high dimensional data.
In Chapter 1, we consider the puzzle of why highly significant variables aren't automatically good predictors. This problem occurs in both simple and complex data. We offer explanations and statistical insights into the puzzle. We suggest shifting the research agenda toward searching for a criterion to locate highly predictive variables rather than highly significant variables. We offer an alternative approach, the Partition Retention (PR) method, which was effective in reducing prediction error from 30% to 8% on a long studied breast cancer data set.
In Chapter 2, we propose approaching prediction from a framework grounded in finding the correct prediction rates of variables. While intuitively obvious, not nearly enough attention has been paid to creating a clear theoretical framework for prediction. We present an objective function for prediction rates and consider but ultimately reject an estimator based on the sample analog of the solution due to its inability to distinguish predictive variables from noisy ones, which leads to an inability to estimate it. We offer an alternative solution and demonstrate that the PR's I-score asymptotically approaches this alternative solution. The I-score for a variable set can be written as an asymptotic lower bound for the correct prediction rate. We offer simulations and applications of the I-score on real data.
In Chapter 3, I propose a new approach to predicting civil war onsets that emphasizes variable selection. A good variable selection approach should search for variables based on a criterion of predictivity and find variable interactions. I suggest the PR method to conduct variable selection and illustrate with simulations and an application to civil wars data, comparing results with alternative approaches. The PR identifies variable sets, some as large as 5 or 6 variables, to predict war onsets. Using these variable sets to predict boosts correct prediction rates on out of sample data from 78.98% to 98.05%. The application demonstrates gains in prediction rates for political phenomena like civil wars when including a research step for variable selection.