UC San Diego
Learning from data with low intrinsic dimension
- Author(s): Verma, Nakul
- Verma, Nakul
- et al.
The information explosion of the past few decades has created tremendous opportunities for Machine Learning- based data analysis. Modern data typically possesses a large number of features. Consider the task of predicting the effectiveness of a particular treatment by analyzing a patient's genome. One hopes that by measuring several gene expression levels one can capture relevant information, leading to better predictions. However, the presence of a large number of irrelevant features adds to the statistical and computational complexity of the learning algorithm, without helping the practitioner to solve the task at hand. Indeed, conventional statistical wisdom suggests that in a general setting the learning task becomes significantly more difficult with an increase in the number of features, making it especially difficult to design and analyze learning algorithms for modern, high- dimensional data. This dissertation explores a specific way one can cope with this curse of dimension. The key observation is that while modern datasets are represented in high dimensions, they often adhere to some low- dimensional intrinsic structure. This intrinsic structure can manifest itself in several forms: some datasets such as text data have a sparse structure; other datasets such as speech and image articulation data follow a manifold structure. If this intrinsic structure is in fact low-dimensional (that is, has few degrees of freedom), then the complexity of learning algorithms should scale only with data's intrinsic dimensionality. In the first part of this dissertation we study how the performance of learning algorithms is affected when the data have a low-dimensional manifold structure. We provide sharp bounds for unsupervised dimensionality reduction, and an improved PAC-learning framework for multiple instance learning in this setting. The second part of this dissertation focuses on understanding and formalizing the general phenomenon of low intrinsic dimension. We explore a few notions that can effectively quantify low- dimensional geometric structure in data. We show that unlike traditional notions, some of the new notions are algorithmically verifiable. We can thus test a given dataset and guarantee a learning rate that scales only with its intrinsic dimension