Significant recent advances in many areas of data collection and processing have introduced many challenges for modeling such data. Data sets have exploded in the number of observations and dimensionality. The explosion in dimensionality has led to advances in the modeling of high dimensional data with regularized and sparse models. One of the more interesting and challenging varieties of high dimensional data are sparse data sets. Sparse data sets arise from many important areas involving human-computer interaction, such as text and language processing, and human-human interaction, such as social networks.
Our motivation in this thesis is to explore the use of sparse models for applications involving sparse data. In some cases, we have made improvements over previous methods that fundamentally involved dense models fitted on, and applied to, sparse data. In other cases, we have adapted sparse models developed for dense data sets. Along the way, we have encountered a recurring issue: due to both subsampling and regularization, we are faced with a problem that sparse models may not adequately capture the full dimensionality of such data and may be inadequate for prediction on test data.
The utility of sparse models have been demonstrated in contexts with very high dimensional dense data. In this dissertation, we shall examine two applications and modeling methods involving sparse linear models and sparse matrix decompositions. Our first application involves natural language processing and ranking, the second involves recommendation systems and matrix factorization.
In Chapter 2, we developed a novel and powerful visualization system. We named our system Bonsai as it enables a curated process of developing trees that partition the joint space of data and models. By exploring the product space of the space of training data, the space of modeling parameters, and the space of test data, we can explore how our models are developed based on the constraints imposed and the data they attempt to model or predict. More generally, we believe we have introduced a very fruitful means of exploring a multiplicity of models and a multiplicity of data samples.
Chapter 3 is based on our work in the Netflix Prize competition. In contrast to others' use of dense models for this sparse data, we sought to introduce modeling methods with tunable sparsity. In this work, we found striking difficulties in modeling the data with sparse models, and identified concerns about the utility of sparse models for sparse data.
In conclusion, this thesis presents several methods, and limitations of such methods, for modeling sparse data with sparse models. These limitations are suggestive of new directions to pursue. In particular, we are optimistic that future research in modeling methods may find new ways to tune models for density, when applied to sparse data, just as much research on models for dense data has involved tuning models for sparsity.