Efficient low-rank estimation from sparse social science data
- Author(s): Zhang, Chelsea
- Advisor(s): Sekhon, Jasjeet;
- Jordan, Michael I
- et al.
Social science datasets, notably surveys and panel data, are prone to missingness. Data may be missing by accidents of data collection and nonresponse, or by design as a way to reduce respondent burden. This dissertation focuses on low-rank estimation from sparse datasets with matrix structure. In this unabashedly small-data setting, we study how to improve sample efficiency in several empirical scenarios.
Chapter 1 addresses imputation of missing survey responses. We introduce our approach of applying matrix factorization to the incomplete response matrix and evaluate its frequency properties. To reduce burden, we ask questions selectively, intentionally creating missingness at the individual level. We develop a procedure that optimally designs a short survey given our imputation method. Specifically, we choose questions that maximize information about latent user position. This active strategy reduces the error of imputations in simulations of political surveys and in a Facebook survey experiment. We extend this method to ordinal data, which requires approximate inference but delivers an adaptive question order. Finally, we present evidence that reordering questions in the Facebook survey results in limited bias.
Subsequent chapters do not optimize survey design, but instead seek to harness additional information by changing the model specification. In Chapter 2, we consider estimating opinion at the subgroup level, in particular state-level opinion on political issues. Small-area estimation methods, such as multilevel regression and poststratification, typically model a single response variable. We borrow strength across related questions via a latent factor model, which performs multilevel regression into latent space. Our joint model allows for missingness at the group level. We simulate a survey that asks all questions of interest in four states while asking just four pilot questions in all other states. For many non-pilot questions, estimates of state opinion by the joint model have shorter intervals and smaller error. However, if responses are not sparse, a univariate baseline is preferable.
In panel datasets, where observations are across time rather than across questions, missingness arises from differences in the availability and frequency of data series. Latent factor methods are popular for denoising, dimension reduction and forecasting; for incomplete panels they also provide imputation. Chapter 3 reviews the econometric literature on low-rank methods for panel data, emphasizing the setting with sparsity. The time dimension endows these datasets with additional structure, and several approaches exist for modeling serial correlation. There is evidence that such dynamic approaches yield efficiency gains in small or sparse samples. A surprising number of methods, however, are static. These studies make other contributions, such as inferential theory, error bounds and bias correction.
Throughout this dissertation we focus on social science applications, but missing data, whether for individuals, for groups or across time, is a universal concern. Low-rank methods that make efficient use of sparse data are broadly applicable.