Search

Scholarly Works (11 results)

Sort By:

Show:

Thesis
Peer Reviewed

Testing in Network Models with Community Structure

Zhang, Linfan
Advisor(s): Amini, Arash A

UCLA Electronic Theses and Dissertations (2022)

We consider the problem of testing in network models with community structures. In the first part, we propose a goodness-of-fit test for degree-corrected stochastic block models (DCSBM). The test is based on an adjusted chi-square statistic for measuring equality of means among groups of $n$ multinomial distributions with $d_1,\dots,d_n$ observations. In the context of network models, the number of multinomials, $n$, grows much faster than the number of observations, $d_i$, corresponding to the degree of node $i$, hence the setting deviates from classical asymptotics. We show that a simple adjustment allows the statistic to converge in distribution, under null, as long as the harmonic mean of $\{d_i\}$ grows to infinity. When applied sequentially, the test can also be used to determine the number of communities. Since the test statistic does not rely on a specific alternative, its utility goes beyond sequential testing and can be used to simultaneously test against a wide range of alternatives outside the DCSBM family. We show the effectiveness of the approach by extensive numerical experiments with simulated and real data. In the second part, we provide theoretical guarantees for label consistency in generalized $k$-means problems, with an emphasis on the overfitted case where the number of clusters used by the algorithm is more than the ground truth. We provide conditions under which the estimated labels are close to a refinement of the true cluster labels. We consider both exact and approximate recovery of the labels. Our results hold for any constant-factor approximation to the $k$-means problem. The results are also model-free and only based on bounds on the maximum or average distance of the data points to the true cluster centers. These centers themselves are loosely defined and can be taken to be any set of points for which the aforementioned distances can be controlled. We show the usefulness of the results with applications to some manifold clustering problems.

Cover page: Testing in Network Models with Community Structure

Article
Peer Reviewed

Adjusted chi-square test for degree-corrected block models

UCLA Previously Published Works (2023)

Creative Commons 'BY' version 4.0 license

Thesis
Peer Reviewed

Prediction Model Development of Seismic Building Responses

Sun, Han
Advisor(s): Amini, Arash A

UCLA Electronic Theses and Dissertations (2018)

The ability to predict building responses subjected to an earthquake could be used to identify building damage which would largely reduce human inspection effort and operation downtime. This thesis explores various of machine learning methods to formulate prediction model for seismic building responses over the great Los Angeles region using three actual earthquake scenario data (1994 Northridge, USA, 1999 Chi-Chi, Taiwan and 2000 Tottori, Japan). The result shows that the geospatial interpolation method kriging outperforms other candidates among all earthquakes in both accuracy and model stability using criteria such as cross-validation and median absolute residual difference. Some inconsistency in accuracy levels between different earthquakes are caused by 1)earthquake characteristics

and 2)representativeness of data samples of each event.

Cover page: Prediction Model Development of Seismic Building Responses

Thesis
Peer Reviewed

comparison of different hierarchical Dirichlet process implementations

Wu, Peichen
Advisor(s): Amini, Arash

UCLA Electronic Theses and Dissertations (2020)

The Hierarchical Dirichlet Process (HDP) is an important Bayesian nonparametric model for grouped data, such as corpus or document collections. It can be very useful in an NLP setting where we are trying to classify documents in a corpus. A great advantage of HDP is its flexibility: we do not need to specify the number of components (or topics) we want and can instead let the data decide. Like other Bayesian nonparametric models, exact posterior inference is intractable, instead we can use Monte Carlo Markov Chain (MCMC) methods to estimate the posterior distribution, and different MCMC methods can affect the performance of the HDP implementation. In this thesis, we will compare four different HDP samplers by applying them to a set of simulated data and a set of real data, and we will do this by comparing the mixing time of their NMI (normalized mutual information, which can be considered as the ``amount of information" obtained about one variable by observing the other variable) and perplexity.

Cover page: comparison of different hierarchical Dirichlet process implementations

Thesis
Peer Reviewed

High-dimensional Principal Component Analysis

Amini, Arash A.
Advisor(s): Wainwright, Martin J

UC Berkeley Electronic Theses and Dissertations (2011)

Advances in data acquisition and emergence of new sources of data, in recent years, have led to generation of massive datasets in many fields of science and engineering. These datasets are usually characterized by having high dimensions and low number of samples. Without appropriate modifications, classical tools of statistical analysis are not quite applicable in these "high-dimensional" settings. Much of the effort of contemporary research in statistics and related fields is to extend inference procedures, methodologies and theories to these new datasets. One widely used assumption which can mitigate the effects of dimensionality is the sparsity of the underlying parameters. In the first half of this thesis we consider principal component analysis (PCA), a classical dimension reduction procedure, in the high-dimensional setting with "hard" sparsity constraints. We will analyze the statistical performance of two modified procedures for PCA, a simple diagonal cut-off method and a more elaborate semidefinite programming relaxation (SDP). Our results characterize the statistical complexity of the two methods, in terms of the number of samples required for asymptotic recovery. The results show a trade-off between statistical and computational complexity. In the second half of the thesis, we consider PCA in function spaces (fPCA), an infinite-dimensional analog of PCA, also known as Karhunen-Loéve transform. We introduce a functional-theoretic framework to study effects of sampling in fPCA under smoothness constraints on functions. The framework generates high dimensional models with a different type of structural assumption, an "ellipsoid" condition, which can be thought of as a soft sparsity constraint. We provide a M-estimator to estimate principal component subspaces which takes the form of a regularized eigenvalue problem. We provide rates of convergence for the estimator and show minimax optimality. Along the way, some problems in approximation theory are also discussed.

Cover page: High-dimensional Principal Component Analysis

Thesis
Peer Reviewed

Optimal bipartite network clustering

Zhou, Zhixin
Advisor(s): Amini, Arash Ali

UCLA Electronic Theses and Dissertations (2018)

We consider the problem of bipartite community detection in networks, or more generally the network biclustering problem. We present a fast two-stage procedure based on spectral initialization followed by the application of a pseudo-likelihood classifier twice. Under mild regularity conditions, we establish the weak consistency of the procedure (i.e., the convergence of the misclassification rate to zero) under a general bipartite stochastic block model. We show that the procedure is optimal in the sense that it achieves the optimal convergence rate that is achievable by a biclustering oracle, adaptively over the whole class, up to constants. The optimal rate we obtain sharpens some of the existing results and generalizes others to a wide regime of average degree growth. As a special case, we recover the known exact recovery threshold in the $\log n$ regime of sparsity. To obtain the general consistency result, as part of the provable version of the algorithm, we introduce a block partitioning scheme that is also computationally attractive, allowing for distributed implementation of the algorithm without sacrificing optimality. The provable version of the algorithm is derived from a general blueprint for pseudo-likelihood biclustering algorithms that employ simple EM type updates. We show the effectiveness of this general class by numerical simulations.

Cover page: Optimal bipartite network clustering

Thesis
Peer Reviewed

An Analysis of Community Detection Methods in Multi-layer Networks

Guo, Jiaming
Advisor(s): Amini, Arash AA

UCLA Electronic Theses and Dissertations (2021)

The community structures commonly exist in real-world networks such as brain networks, social networks, or trade networks. Since the information of a real-world network is often captured by us with different measures of view, such real-world networks often have a multi-layer structure with different layers sharing the same community assignment. In this scenario, being able to find out the community assignment consistently will help us understand the properties and behaviors of the network so that we can exploit these networks more effectively. In this thesis, we adopt multiple methods to solve the community detection task in different scenarios and discuss the pros and cons of them by comparing the results from multiple methods. We also propose and compare some of the rank-estimation methods, which are used for solving the number of different communities in a network.

Cover page: An Analysis of Community Detection Methods in Multi-layer Networks

Article

Learning Directed Acyclic Graphs with Penalized Neighbourhood Regression

UCLA Previously Published Works (2015)

We study a family of regularized score-based estimators for learning the structure of a directed acyclic graph (DAG) for a multivariate normal distribution from high-dimensional data with $p\gg n$. Our main results establish support recovery guarantees and deviation bounds for a family of penalized least-squares estimators under concave regularization without assuming prior knowledge of a variable ordering. These results apply to a variety of practical situations that allow for arbitrary nondegenerate covariance structures as well as many popular regularizers including the MCP, SCAD, $\ell_{0}$ and $\ell_{1}$. The proof relies on interpreting a DAG as a recursive linear structural equation model, which reduces the estimation problem to a series of neighbourhood regressions. We provide a novel statistical analysis of these neighbourhood problems, establishing uniform control over the superexponential family of neighbourhoods associated with a Gaussian distribution. We then apply these results to study the statistical properties of score-based DAG estimators, learning causal DAGs, and inferring conditional independence relations via graphical models. Our results yield---for the first time---finite-sample guarantees for structure learning of Gaussian DAGs in high-dimensions via score-based estimation.

Cover page: Learning Directed Acyclic Graphs with Penalized Neighbourhood Regression

Thesis
Peer Reviewed

Order-based Learning of Bayesian Networks: Regularized Cholesky Score and Distributed Data

UCLA Electronic Theses and Dissertations (2021)

Bayesian networks are a popular class of graphical models to encode conditional independence and causal relations among variables by directed acyclic graphs (DAGs). In this thesis, we focus on developing algorithms to estimate Bayesian network structures. We propose two structure learning methods, and both of them minimize regularized negative log-likelihood functions over the space of orderings.

First, we propose the annealing on regularized Cholesky score (ARCS) algorithm to learn Gaussian Bayesian networks. The scoring function of ARCS is derived from regularizing the Gaussian DAG likelihood, and its optimization is an alternative form of the sparse Cholesky decomposition, which depends on the choice of permutation (matrix) $P$. For this reason, we name our objective function the regularized Cholesky (RC) score of permutations. Essentially, minimizing the RC score is a joint optimization over a permutation $P$ and a lower triangular matrix $L$, because the acyclic constraint of DAGs has been translated into a strictly lower triangular matrix given a permutation. ARCS uses simulated annealing to search over the permutation space and an effective first order method, called the proximal gradient algorithm, to compute the optimal DAG that is compatible with $P$. Combined, the two approaches allow us to quickly and effectively search over the space of DAGs without the need to verify the acyclicity constraint or to enumerate possible parent sets given a candidate topological sort. The annealing aspect of the optimization is able to consistently improve the accuracy of DAGs learned by greedy and deterministic search algorithms. Through extensive numerical tests, ARCS has demonstrated high structure learning accuracy and outperformed existing methods by a great margin when using observational and experimental data to learn Gaussian DAGs. As a byproduct, ARCS can accurately estimate Gaussian covariance matrix, and it has achieved higher test likelihood than other covariance estimation methods. In terms of theoretical results, we establish the consistency of our RC score in estimating topological sorts and DAG structures in the large-sample limit.

The second method we propose is the distributed annealing on regularized likelihood score (DARLS) algorithm, which generalizes the ARCS algorithm to learn a flexible family of DAGs from distributed data. To the best of our knowledge, it is the first method that uses distributed optimization to learn causal structures from data stored over different machines. DARLS searches over the space of topological sorts with simulated annealing strategy for a high-scoring causal graph, where the optimal graphical structure compatible with a sort is found by a distributed optimization method. We show that the estimate sequence generated by the distributed optimization method converges to a global optimizer of the overall score computed on all data across local machines. Additionally, we propose generalized linear DAG models where the conditional distributions of a Bayesian network is given by generalized linear models (GLMs) with canonical links. GLMs is a flexible family of distributions that take various types of data, and thus the use of it greatly increase the applicability of our DAG models. In our simulation studies, DARLS has demonstrated competing performance with distributed data against other existing methods using the overall data across local machines. It also exhibits higher predictive power than other methods in a real-world application for modeling protein-DNA binding networks using ChIP-Sequencing data.

Cover page: Order-based Learning of Bayesian Networks: Regularized Cholesky Score and Distributed Data

Thesis
Peer Reviewed

Problems in Epidemic Inference on Complex Networks

Kazemitabar Amirkolaei, Seyed Jalil
Advisor(s): Amini, Arash A

UCLA Electronic Theses and Dissertations (2019)

In this PhD dissertation, we study epidemics on networks of contacts through the lens of statistical inference. The current work is an attempt to infer the propagation parameters following the outset of an epidemic spread. My contributions rely on the progress on mathematical modeling of infectious outbreak, information diffusion, and viral habit formation. These achievements paved the path to forecast and contain the spread of infectious diseases and to optimize viral marketing campaigns. What distinguishes this work is the forensics view that aims to infer the network or the propagation parameters from the final stage of an epidemic. We study here multiple problems of this kind including epidemic source identification and epidemic network reconstruction. Such problems are NP-hard by nature and previous contributions are ad-hoc and inconclusive for realistic networks, either in size or structure. This work proposes new methods that estimate the parameters of interest in polynomial time with arbitrary accuracy. We provide theoretical error bound guarantees for some of the solutions. We accompany the results with comparative simulations on popular networks from social media, urban infrastructure, and disease pandemics.

Cover page: Problems in Epidemic Inference on Complex Networks