This dissertation focuses on using the machine learning technique, boosting, for causal inference in the instrumental variable (IV) regression models.
In Chapter 1, when endogenous variables are approximated by sieve functions of observable instruments, the number of instruments increases rapidly and many may be invalid or irrelevant. We introduce Double Boosting (DB) which consistently selects only valid and relevant instruments even when there are more instruments than the sample size. We estimate the parameter of interest using generalized method of moments (GMM) with selected instruments. We refer this method as Double Boosting GMM (DB-GMM). We show that DB does not select weakly relevant or weakly valid instruments. In Monte Carlo, we compare DB-GMM with other methods such as GMM using Lasso penalty (penalized GMM). In the application of estimating the BLP-type automobile demand function, where price is endogenous and instruments are high dimensional functions of product characteristics, we find the DB-GMM estimator of the price elasticity of demand is more elastic than other estimators.
Extending from Chapter 1, Chapter 2 combines the DB selection algorithm from Chapter 1 with the multiple-layer neural networks (NN) for the first-stage IV estimation, where high dimensional sieve instrument variables are the activation functions at the last hidden layer of the neural networks.
Chapter 3 studies the panel data models with many instruments. When the regressors are endogenous in the panel data models, we employed the 2SLS approach for the FE estimator. We denote it as FE-2SLS. We find that the FE-2SLS estimator is sensitive to the number of instruments, where it is inconsistent when the number of instruments increases. We show that using the two regularization methods, SCAD and L2Boosting, for instrument selection make the FE-2SLS estimator more robust and restore its consistency when there are many instruments. Furthermore, we consider a Stein-like combined estimator of the FE and FE-2SLS estimators and provide its asymptotic properties. A empirical study is conducted for the economics of real house price using the US state level panel data.