Essays on Classification, Variable Selection and Statistical Inference
- Author(s): Chu, Jianghao;
- Advisor(s): Ullah, Aman;
- Lee, Tae-Hwy
- et al.
This dissertation covers topics in classification with high-dimensional data, variable selection in sparse semiparametric single-index models and statistical inference under heteroskedasticity. In particular, Chapter 1 provides the motivation and background of the dissertation.
Chapter 2 provides a summary of boosting methods for classification, namely Discreet AdaBoost, Real AdaBoost, P-AdaBoost, Gentle AdaBoost and LogitBoost. We compare these methods with alternative machine learning classification tools such as Deep Neural Network and demonstrate the empirical applications in economics, such as prediction of business cycle turning points and directional prediction of stock price indexes.
Chapter 3 generalizes the Discreet AdaBoost shown in Chapter 2 for binary classification problem with state-dependent loss functions. We introduce Asymmetric AdaBoost that solves the asymmetric maximum score problem with high-dimensional data. Asymmetric AdaBoost produces a nonparametric classifier via minimizing the ``asymmetric exponential risk'' which is a convex surrogate of the traditional non-convex score risk or 0-1 risk. The convex risk function gives huge computation advantage over non-convex risk functions, e.g. Maximum Score (Manski, 1975, 1985), especially when the data is high-dimensional.
Chapter 4 considers the "Regularization of Derivative Expectation
Operator" (RODEO) of Lafferty and Wasserman (2008) and propose a modified RODEO algorithm for sparse semiparametric single-index models which we call the SIM-RODEO. The SIM-RODEO method is able to distinguish relevant explanatory variables from irrelevant variables and gives a competitive estimator for the model.
In addition, the algorithm finishes in a reasonable period of time. In addition, the algorithm finishes in a reasonable period of time.
Chapter 5 investigates the methods for statistical inference under the presence of heteroskedasticity of unknown form in the disturbances of linear regression models. We propose an F-type test statistic for testing regression parameters under the heteroskedasticity of unknown form. The accuracies of the test statistic are confirmed by extensive Monte Carlo experiments. And Chapter 6 concludes.