Open Access Publications from the University of California

## Fast and Effective Approximations for Summarization and Categorization of Very Large Text Corpora

• Author(s): Godbehere, Andrew B.
Specifically, minimizing the probability of large deviations of a linear regression model while assuming a $k$-class probabilistic text model yields a $k$-dimensional optimization problem, where $k$ can be much smaller than either the number of documents or features. Further, a simple non-negativity constraint on the problem yields a sparse result without the need of an $\ell_1$ regularization. The problem is also considered and analyzed in the case of uncertainty in the model parameters. Towards the problem of estimating such probabilistic text models, a fast implementation of Sparse Principal Component Analysis is investigated and compared with Latent Dirichlet Allocation. Methods of fitting topic models to a dataset are discussed. Specific examples on a variety of text datasets are provided to demonstrate the efficacy of the proposed methods.