Stochastic Gradient MCMC: Algorithms and Applications
Despite the powerful advantages of Bayesian inference such as quantifying uncertainty, ac- curate averaged prediction, and preventing overfitting, the traditional Markov chain Monte Carlo (MCMC) method has been regarded unsuitable for large-scale problems because it required processing the entire dataset per iteration rather than using a small random mini- batch as performed in the stochastic gradient optimization. The first attempt toward the scalable MCMC method based on stochastic gradients is the stochastic gradient Langevin dynamics (SGLD) proposed by Welling and Teh . Originated from the Langevin Monte Carlo method, SGLD achieved O(n) computation per iteration (here, n is the size of a minibatch) by using stochastic gradients estimated using minibatches and skipping the Metropolis-Hastings accept-reject test.
In this thesis, we introduce recent advances in the stochastic gradient MCMC method since the advent of SGLD. Our contributions are two-fold: algorithms and applications. In the algorithm part, we first propose the stochastic gradient Fisher scoring algorithm (SGFS) which resolves two drawbacks of SGLD: the poor mixing rate and the arbitrarily large bias occurred when using large step sizes. Then, we also propose the distributed SGLD (D-SGLD) algorithm which makes it possible to extend the power of stochastic gradient MCMC to the distributed computing systems. In the second part, we apply the developed SG-MCMC algorithms to the most popular large-scale problems: the topic modeling using the latent Dirichlet allocation model, recommender systems using matrix factorization, and community modeling in social networks using mixed membership stochastic blockmodels. By resolving the unique challenges raised by each of the applications, which make it difficult to directly use the existing SG-MCMC methods, we obtain the-state-of-the-art results outperforming existing approaches using collapsed Gibbs sampling, stochastic variational inference, or dis- tributed stochastic gradient descent.