Advances in Exponential-family Random Graph Models: Computation, Model Selection, and Methodology
- Author(s): Yin, Fan
- Advisor(s): Butts, Carter T
- et al.
Networks (graphs) are broadly used to represent relations between entities in a wide range of scientific fields. Exponential-family random graph models (ERGMs) provide a highly general way of specifying distributions on graphs, allowing the complex dependence structure of edges in a network to be specified in terms of local structural properties. This thesis addresses problems related to three lines of inquiry for ERGMs: faster Bayesian inference algorithms; comparison of newly proposed and traditional model selection techniques; and methodological innovation for modeling ensembles of networks.
In Chapter 2 of this dissertation, we present a highly parallel algorithm that enables fast Bayesian inference on ERGMs. The impetus for this work comes from the facts that conducting Bayesian inference for ERGMs is challenging because of the intractability of both the likelihood and posterior normalizing factor and auxiliary-variable based Markov Chain Monte Carlo (MCMC) methods for this problem are asymptotically exact but computationally demanding. We propose a kernel-based approximate Bayesian computation algorithm for fitting ERGMs, which is easily parallelizable. Through empirical comparisons against the state-of-the-art approximate exchange algorithm, we show that the proposed algorithm yields comparable accuracy to the state-of-the-art MCMC approach, the approximate exchange algorithm (Caimo and Friel, 2011), while cutting the wallclock runtime by half with 5 cores, and by 80\% with 30 cores.
In Chapter 3 of this dissertation, we carry out simulation studies to compare newly proposed and traditional model selection techniques. This work is driven by the importance of understanding the strengths and weaknesses of those model selection techniques for ERGMs that are currently available, including Akaike information criterion (Akaike, 1973), Bayesian information criterion (Schwarz, 1978), Held-Out Predictive Evaluation (HOPE) (Yin et al., 2019), Bayes factors (Raftery, 1995) and graphical goodness of fit (Hunter et al., 2008). In particular, we focus on the first three techniques, as the calculation of Bayes factor for ERGMs relies on reversible jump Markov chain Monte Carlo algorithm extension of the approximate exchange algorithm (Caimo and Friel, 2013), which is hard to implement and tune; the graphical goodness of fit is more suitable for checking whether a model is adequate rather than comparing competing models. The simulation studies are carried out under two scenarios, closed-M (under which the true model is among the set of candidate models) and open-M (under which the true model is not among the set of candidate models), and we evaluate the performance of model selection techniques from various aspects covering the model selection accuracy, predictive deviance and prediction accuracy of edge variables.
In Chapter 4 of this dissertation, we propose a novel methodology that can be used for modeling the generative processes of ensembles of networks. The motivation of this work is that ensembles of networks arise in many scientific fields, but there are few statistical tools for inferring their generative processes, particularly in the presence of both dyadic dependence and cross-graph heterogeneity. To fill in this gap, we propose characterizing network ensembles via finite mixtures of exponential family random graph models, a framework for parametric statistical modeling of graphs that has been successful in explicitly modeling the complex stochastic processes that govern the structure of edges in a network. Our proposed methodology can also be used for applications such as model-based clustering of ensembles of networks and density estimation for complex graph distributions. We develop a Metropolis-within-Gibbs algorithm to conduct fully Bayesian inference and adapt a version of deviance information criterion for missing data models to choose the number of latent heterogeneous generative mechanisms. Simulation studies show that the proposed procedure can recover the true number of latent heterogeneous generative processes and corresponding parameters. We demonstrate the utility of the proposed approach using an ensemble of political co-voting networks among U.S. Senators and an ensemble of advice-seeking networks among school teachers.