Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Large-Scale Interpretable Multi-View Learning for Very High-Dimensional Problems with Application to Multi-Omic Data

Abstract

We discuss the sparse Canonical Correlation Analysis (CCA) problem in the context of high-dimensional multi-view problems, where we aim to discover interpretable association structures among multiple random vectors via their respective views with an emphasis on setting where the number of observations is too few compared to the number of covariates. Throughout this text, we use the term view define as observations of a random vector on an ordered set of subjects, which is the same for observations of all other random vectors involved in the analysis. We denote each view by Xi ∈ R n×pi , i = 1, . . . , m, where m is the number of random vectors, or equivalently number of views. In the first two chapters we consider linear association structures shared among multiple views, where the objective is to learn sparse linear combinations of multiple sets of covariates such that they are maximally correlated. In the first chapter we introduce a new approach to the sparse CCA, where we learn the sparsity pattern of the canonical directions in the first stage by casting this problem as two successively shrinking concave minimization programs which are solved via a first-order algorithm, and in the second stage we solve a small CCA problem by considering the sparsity patterns estimated in the first stage. We demonstrate via simulations that, in comparison to other available methods, our approach demonstrates superior convergence properties and capability to recover the underlying sparsity patterns and the magnitudes of the non-zero elements of the canonical directions, as well as, significantly lower computational cost. We then apply our method to a multi-omic environmental genetics study on fruit flies, where we hypothesise about the mechanism of adaptation of this model organism to environmental pesticides.

In the second chapter we tackle a shared short-coming of sparse PCA and sparse CCA methods, which is that, in case of estimating multiple components or canonical directions for each view, these directions are not orthogonal to each other, which diminishes interpretability. While all other approaches estimate canonical directions one-by-one via the contraction scheme, we offer a block scheme where we estimate the first d canonical directions simultaneously. In this setting, we can more easily impose orthogonality, and also encourage disjoint sets of non-zero elements within multiple directions, resulting in more interpretable models. We also extended our model to what we call sparse Directed CCA, where we use an accessory variable, defined in the text, to try to capture variations related to a certain hypothesis, rather than the dominant variations which might be proven irrelevant to the main hypothesis. As a validating example, we apply our method to the lung cancer multi-omics available on The Cancer Genome Atlas, using survival data as our accessory variable. While regular sparse CCA exclusively identified correlation structures dominated by and communities separated by gender, our directed sparse CCA correctly identified two underlying communities which were significantly separated by survival.

In the final chapter, we generalize our framework to discover non-linear association structures by proposing a two-stage sparse kernel CCA algorithm. We learn maximally aligned kernels in the first stage via sparse Multiple Kernel Learning (MKL), and then solve a KCCA problem in the second stage using learned kernels. We perform sparse MKL by forming an alignment matrix where its elements are the sample Hilbert Schmidt Independence Criterion of base kernels of pairs of views. These base kernels are functions of small sets of covariates of each view; therefore our sparse MKL approach provides interpretable solutions, as sparse convex linear combinations of base kernels. We finally provide an Apache Spark implementation of our methods introduced throughout the dissertation which makes users capable of running our methods on very high-dimensional datasets, e.g. observations on millions of Single Nucleotide Polymorphism loci, using distributed computing. We call this package SparKLe.

R versions of our algorithms are also available. MuLe, BLOCCS, and SparKLe-R implements our methods presented in Chapters 1,2, and 3, respectively.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View