Search

Scholarly Works (53 results)

Sort By:

Show:

Article
Peer Reviewed

Methods for detecting introgressed archaic sequences

Sankararaman, Sriram

UCLA Previously Published Works (2020)

Analysis of genome sequences from archaic and modern humans have revealed multiple episodes of admixture between highly-diverged population groups. Statistical methods that attempt to localize DNA segments introduced by these events offer a powerful tool to investigate recent human evolution. We review recent advances in methods for detecting introgressed sequences.

Cover page: Methods for detecting introgressed archaic sequences

Thesis
Peer Reviewed

Scalable and Robust Statistical Inference Algorithms for Linking Genotypes to Phenotypes

Pazokitoroudi, Ali
Advisor(s): Sankararaman, Sriram

UCLA Electronic Theses and Dissertations (2023)

With the advancements in DNA sequencing technology and the decreasing cost of sequencing, there has been exponential growth in the amount of genomic data generated. This growth provides an unprecedented opportunity to access the genotypes of a large population, including millions of genetic variants, and to collect hundreds of thousands of phenotypic measurements from the same individuals. This opens doors to systematically studying the genetic architecture underlying complex traits and diseases. Genetic architecture refers broadly to a complete understanding of all genetic contributions to a given trait as well as to an awareness of the characteristics of this contribution.

During the past decade, variance components analysis has emerged as a robust statistical framework for investigating the genetic architectures of complex traits. To gain accurate and innovative insights into genetic architecture, applying flexible variance component models to large-scale datasets is crucial. However, fitting such models necessitates the use of scalable algorithms. Common approaches for estimating variance components involve searching for parameter values that maximize the likelihood or the restricted maximum likelihood (REML). Despite several algorithmic advancements, computing REML estimates of variance components on extensive datasets like the UK Biobank, which consists of approximately 500,000 genotyped individuals, millions of single nucleotide polymorphisms (SNPs), and hundreds of thousands of phenotypes, remains challenging. This thesis introduces a set of scalable and robust statistical inference algorithms rooted in variance component analysis. These algorithms are designed to estimate the variation in a trait that can be explained by linear and non-linear functions of the genotype, such as the interaction between alleles at a single genetic variant (dominance), the interaction between genetic variants (epistasis), and the interaction between environmental factors and genetic variants (GxE). Furthermore, these algorithms aim to estimate the distribution of these effects across the genome.

By applying our methods to the UK Biobank dataset, we uncover valuable insights into the genetic architecture of complex traits. Notable observations are as follows. First, we observe that both per-allele squared additive and GxE effect size increase with decreasing minor allele frequency (MAF) and linkage disequilibrium (LD). Second, testing whether GxE heritability is enriched around genes that are highly expressed in specific tissues, we find significant tissue-specific enrichments that include brain-specific enrichment for BMI and Basal Metabolic Rate in the context of smoking, adipose-specific enrichment for WHR in the context of sex, and cardiovascular tissue-specific enrichment for total cholesterol in the context of age. Third, we detect epistasis effects between SNPs located on the same chromosome and between SNPs located on different chromosomes. Fourth, our analyses indicate a limited contribution of dominance heritability to complex trait variation.

Cover page: Scalable and Robust Statistical Inference Algorithms for Linking Genotypes to Phenotypes

Thesis
Peer Reviewed

Leveraging genetic and electronic health record data to understand complex traits and rare diseases

Johnson, Ruth Dolly
Advisor(s): Sankararaman, Sriram

UCLA Electronic Theses and Dissertations (2023)

The biobank era of genomics has ushered in a multitude of opportunities for precision medicine research. In particular, biobanks connected to electronic health records (EHR) provide rich phenotype information used to study to clinical phenome. First, I describe two computational methods designed to infer the genetic architecture of complex traits using biobank-scale data. Both methods are based on Markov Chain Monte Carlo techniques. Next, I provide an overview of the UCLA ATLAS Community Health Initiative (ATLAS), an EHR-linked biobank embedded within UCLA Health. Using this data set, I explore the role of genetic ancestry in common disease risk across the UCLA patient population. Next, I include a review of how race, ethnicity, and genetic ancestry are utilized in the field of EHR- linked biobanks. Finally, I propose an EHR-based algorithm, called PheNet, which identifies undiagnosed patients with Common Variable Immunodeficiency Disorders and demonstrate its application across a total of 5 University of California Health systems.

Cover page: Leveraging genetic and electronic health record data to understand complex traits and rare diseases

Thesis
Peer Reviewed

Methods for detecting structure in large-scale genomic data

Chiu, Alec Matthew
Advisor(s): Sankararaman, Sriram

UCLA Electronic Theses and Dissertations (2022)

Large-scale repositories of genomic data are providing opportunities for researchers to answer biological questions at unprecedented resolution. Uncovering the structure underlying these datasets is a fundamental task where the structure can correspond to biological signals of interest or to confounders such as ancestry and batch effects that must be accounted for to prevent spurious findings. While discovering structure is a challenging problem, the growing size of genomic datasets leads to computational bottlenecks that further complicate their analysis. Here, we propose three scalable approaches for detecting structure in genomic data. We present ProPCA, a probabilistic principal component analysis method for large-scale genomic data. We also introduce SCOPE, a method for inferring admixture proportions from biobank-scale data. Both these methods utilize randomized eigendecomposition and the unique structure of the genotype matrix to perform scalable population structure inference. We apply these methods to simulations to reveal that they remain accurate while improving on runtime compared to existing methods. We applied both methods on the UK Biobank, a dataset containing half a million individuals, to uncover fine-scale structure within the United Kingdom. We subsequently introduce a statistical testing framework for detecting variance and covariance differences by extending eigengene analysis through a set of transformations and randomized eigendecomposition. We use RNA-seq data from individuals with psychiatric disease to reveal several (co)variance differences; highlighting the need to look beyond mean effects. With the increasing availability of large biological datasets, our work enables researchers to efficiently discover and test for structure and perform downstream analyses.

Cover page: Methods for detecting structure in large-scale genomic data

Thesis
Peer Reviewed

Initializing Hard-Label Black-Box Adversarial Attacks Using Known Perturbations

Mathur, Shaan Karan
Advisor(s): Sankararaman, Sriram

UCLA Electronic Theses and Dissertations (2021)

We empirically show that an adversarial perturbation for one image can be used to accelerate attacks on another image. Specifically, we show how to improve the initialization of the hard-label black-box attack Sign-OPT, operating in the most challenging attack setting, by using previously known adversarial perturbations. Whereas Sign-OPT initializes its attack by searching along random directions for the nearest boundary point, we search for the nearest boundary point along the direction of previously known perturbations. This initialization strategy leads to a significant drop in initial distortion in both the MNIST and CIFAR-10 datasets. Identifying the similar vulnerability of images is a promising direction for future research.

Cover page: Initializing Hard-Label Black-Box Adversarial Attacks Using Known Perturbations

Thesis
Peer Reviewed

Statistical models for analyzing human genetic variation

Sankararaman, Sriram
Advisor(s): Jordan, Michael I

UC Berkeley Electronic Theses and Dissertations (2010)

Advances in sequencing and genomic technologies are providing new opportunities to understand the genetic basis of phenotypes such as diseases. Translating the large volumes of heterogeneous, often noisy, data into biological insights presents challenging problems of statistical inference. In this thesis, we focus on three important statistical problems that arise in our efforts to understand the genetic basis of phenotypic variation in humans.

At the molecular level, we focus on the problem of identifying the amino acid residues in a protein that are important for its function. Identifying functional residues is essential to understanding the effect of genetic variation on protein function as well as to understanding protein function itself. We propose computational methods that predict functional residues using evolutionary information as well as from a combination of evolutionary and structural information. We demonstrate that these methods can accurately predict catalytic residues in enzymes. Case studies on well-studied enzymes show that these methods can be useful in guiding future experiments.

At the population level, discovering the link between genetic and phenotypic variation requires an understanding of the genetic structure of human populations. A common form of population structure is that found in admixed groups formed by the intermixing of several ancestral populations, such as African-Americans and Latinos. We describe a Bayesian hidden Markov model of admixture and propose efficient algorithms to infer the fine-scale structure of admixed populations. We show that the fine-scale structure of these populations can be inferred even when the ancestral populations are unknown or extinct. Further, the inference algorithm can run efficiently on genome-scale datasets. This model is well-suited to estimate other parameters of biological interest such as the allele frequencies of ancestral populations which can be used, in turn, to reconstruct extinct populations.

Finally, we address the problem of sharing genomic data while preserving the privacy of individual participants. We analyze the problem of detecting an individual genotype from the summary statistics of single nucleotide polymorphisms (SNPs) released in a study. We derive upper bounds on the power of detection as a function of the study size, number of exposed SNPs and the false positive rate, thereby providing guidelines as to which set of SNPs can be safely exposed.

Cover page: Statistical models for analyzing human genetic variation

Article
Peer Reviewed

INTREPID: a web server for prediction of functionally important residues by evolutionary analysis

UC Berkeley Previously Published Works (2009)

We present the INTREPID web server for predicting functionally important residues in proteins. INTREPID has been shown to boost the recall and precision of catalytic residue prediction over other sequence-based methods and can be used to identify other types of functional residues. The web server takes an input protein sequence, gathers homologs, constructs a multiple sequence alignment and phylogenetic tree and finally runs the INTREPID method to assign a score to each position. Residues predicted to be functionally important are displayed on homologous 3D structures (where available), highlighting spatial patterns of conservation at various significance thresholds. The INTREPID web server is available at http://phylogenomics.berkeley.edu/intrepid.

Cover page: INTREPID: a web server for prediction of functionally important residues by evolutionary analysis

Thesis
Peer Reviewed

Efficient methods for Understanding the Genetic Architecture of Complex Traits

UCLA Electronic Theses and Dissertations (2022)

Understanding the genetic architecture of complex traits is a central goal of modern human genetics.Recent efforts focused on building large-scale biobanks, that collect genetic and trait data on large numbers of individuals, present exciting opportunities for understanding genetic architecture. However, these datasets also pose several statistical and computational challenges. In this dissertation, we consider a series of statistical models that allow us to infer aspects of the genetic architecture of single and multiple traits. Inference in these models is computationally challenging due to the size of the genetic data -- consisting of millions of genetic variants measured across hundreds of thousands of individuals.We propose a series of scalable computational methods that can perform efficient inference in these models and apply these methods to data from the UK Biobank to showcase their utility.

Cover page: Efficient methods for Understanding the Genetic Architecture of Complex Traits

Article
Peer Reviewed

Evidence for archaic adaptive introgression in humans.

UC Berkeley Previously Published Works (2015)

As modern and ancient DNA sequence data from diverse human populations accumulate, evidence is increasing in support of the existence of beneficial variants acquired from archaic humans that may have accelerated adaptation and improved survival in new environments - a process known as adaptive introgression. Within the past few years, a series of studies have identified genomic regions that show strong evidence for archaic adaptive introgression. Here, we provide an overview of the statistical methods developed to identify archaic introgressed fragments in the genome sequences of modern humans and to determine whether positive selection has acted on these fragments. We review recently reported examples of adaptive introgression, grouped by selection pressure, and consider the level of supporting evidence for each. Finally, we discuss challenges and recommendations for inferring selection on introgressed regions.

Cover page: Evidence for archaic adaptive introgression in humans.

Creative Commons 'BY-NC-ND' version 4.0 license

Article
Peer Reviewed

Inference of locus-specific ancestry in closely related populations

UCLA Previously Published Works (2009)

Unlabelled

A characterization of the genetic variation of recently admixed populations may reveal historical population events, and is useful for the detection of single nucleotide polymorphisms (SNPs) associated with diseases through association studies and admixture mapping. Inference of locus-specific ancestry is key to our understanding of the genetic variation of such populations. While a number of methods for the inference of locus-specific ancestry are accurate when the ancestral populations are quite distant (e.g. African-Americans), current methods incur a large error rate when inferring the locus-specific ancestry in admixed populations where the ancestral populations are closely related (e.g. Americans of European descent).

Results

In this work, we extend previous methods for the inference of locus-specific ancestry by the incorporation of a refined model of recombination events. We present an efficient dynamic programming algorithm to infer the locus-specific ancestries in this model, resulting in a method that attains improved accuracies; the improvement is most significant when the ancestral populations are closely related. An evaluation on a wide range of scenarios, including admixtures of the 52 population groups from the Human Genome Diversity Project demonstrates that locus-specific ancestry can indeed be accurately inferred in these admixtures using our method. Finally, we demonstrate that imputation methods can be improved by the incorporation of locus-specific ancestry, when applied to admixed populations.

Availability

The implementation of the WINPOP model is available as part of the LAMP package at http://lamp.icsi.berkeley.edu/lamp.

Cover page: Inference of locus-specific ancestry in closely related populations