Analysis of Spectral Clustering and the Lasso under Nonstandard Statistical Models
- Author(s): Rohe, Karl Greiner
- Advisor(s): Yu, Bin
- et al.
Advances in information technology have provided businesses, governments, and the sciences novel tools to collect and store data. These new technologies produce data that does not always fit within the classical statistical paradigm: several independent and identically distributed samples from a distribution with a few parameters. Instead, the modern data sources often contain dependent samples, with several agents that interact, e.g. social networks. In some situations, the number of parameters far exceeds the number of samples, e.g. experiments with fMRI. To create knowledge from these new data sources requires fresh statistical tools and novel theoretical insights. This dissertation uses nonstandard statistical models to study the theoretical properties of three estimators that confront two modern types of data, network data and high dimensional data.
Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social networks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasible method to discover these communities in undirected networks. Chapter 2 uses the Stochastic Blockmodel [Holland et al., 1983], a social network model with well defined communities, to study the estimation properties of the spectral clustering algorithm. For a network generated from the Stochastic Blockmodel, we bound the number of nodes ``misclustered" by spectral clustering. The asymptotic results in Chapter 2 are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional.
Chapter 2 addresses clustering in undirected networks, where relationships do not have a direction. However, in some networks, such as the predator-prey network, relationships point from one actor to another (e.g. from predator to prey). Chapter 3 extends the spectral clustering algorithm and the Stochastic Blockmodel to directed graphs. The new algorithm, Di-Sim (pronounced ``dice `em") produces two distinct types of clusters, one corresponding to incoming edges and another corresponding to outgoing edges. The new Directed Stochastic Blockmodel provides a test bed to conceptualize these two types of clusters. This new model also allows for a theoretical investigation of Di-Sim's asymptotic properties. While directed spectral clustering is not derived from this model, it is shown that, asymptotically, the algorithm can estimate the blocks in the directed Stochastic Blockmodel.
Finally, Chapter 4 investigates the sign consistency of the Lasso under a heteroskedastic, ``Poisson-like" regression model. The performance of the Lasso is well understood under the assumptions of the standard linear model with homoscedastic noise. However, in several applications, the standard model does not describe the important features of the data. Chapter 4 examines how the Lasso performs on a non-standard model that is motivated by medical imaging applications. In these applications, the variance of the noise scales linearly with the expectation of the observation. Like all heteroscedastic models, the noise terms in this Poisson-like model are not independent of the design matrix.