Search

Scholarly Works (60 results)

Sort By:

Show:

Thesis
Peer Reviewed

Browsing in the Library of Babel: Leveraging Evolutionary Information to Improve Protein Modeling

Thomas, Neil
Advisor(s): Song, Yun S

UC Berkeley Electronic Theses and Dissertations (2022)

Proteins are the molecular machines that perform the vast majority of natural biological functions. Discovering proteins to perform novel functions or optimizing them for an existing function are central goals of synthetic biology. Doing so is challenging primarily because for most proteins there is limited understanding of how they function, let alone how to modify them; experimental characterization and crystal structures are expensive and time-consuming to collect. For a given protein, however, genes performing related functions can be found in the the genomes of diverse organisms -- the natural result of the process of evolution. With improved techniques for genetic sequencing, an abundance of data deposited in protein sequence databases has become available. This presents a tantalizing modeling opportunity: models that can understand protein function through the observation of related sequences can reduce the reliance on experimental characterization and unlock new possibilities for protein discovery and optimization. Building such models has been a goal of bioinformatics research, and has more recently emerged as a goal of machine learning research. In particular, ``protein language models,'' models trained to learn a distribution over sequence data, have shown promise in predicting functional properties of proteins.

This work leverages the information in protein sequence databases to the following ends. First, it presents a benchmark for the effectiveness of protein language models using a suite of protein prediction tasks. Second, it draws a connection between a well-established graphical model of protein families and the neural network architecture of protein language models. Third, it presents a framework for deriving synthetic protein fitness landscapes from evolutionary data that can be used to evaluate strategies for model-guided protein design in silico.

Thesis
Peer Reviewed

Statistical Methods for Genome Assembly

Bresler, Maayan
Advisor(s): Song, Yun S

UC Berkeley Electronic Theses and Dissertations (2014)

In the last decade, sequencing technology has progressed rapidly, leading to much faster and cheaper production of short-read data. The challenge of \emph{assembling} the reads into an accurate reconstruction of the sequenced genome, however, has increased. This is because the assembly problem is made more difficult when the reads are shorter, especially for genomes of most higher organisms, which contain complicated repeat structures. In this thesis we study the algorithmic problem of \emph{de novo} DNA sequence assembly, focusing on the challenge of dealing with genomic repeats. We develop two new assembly tools, as well as initiate the study of information-theoretic limits of shotgun sequencing for realistic genomes.

Our first novel algorithm for DNA assembly, Telescoper, is designed for assembly of telomeres. Due to their many repeats, telomeric regions are notoriously difficult to assemble. Telescoper iteratively extends long paths through a series of read-overlap graphs and evaluates them based on a statistical framework. The algorithm utilizes both short and long-insert libraries in an integrated way throughout the assembly process. This approach is shown to effectively resolve some of the complex repeat structures found in the telomeres of yeast genomes.

Our second novel algorithm for DNA assembly, Piper, takes a statistical approach to resolving ambiguity caused by repeats. A lot of potentially useful information is present in paired-end reads, but due to the inherent noise in the insert length and the combinatorial nature of the problem, it is not clear how to best use this information. Piper selects a {\em set} of candidate paths through the contig-graph, and scores them based on their likelihood given a generative model for the reads. The output consists of a ranked set of assemblies (rather than a single assembly) in order to give the maximum information available, while still explicitly encoding unresolved ambiguity. On small simulated datasets, Piper produces excellent error-free assemblies.

In the final portion of the thesis, we investigate the information-theoretic limits of DNA sequencing, focusing on the effect of repeats. Specifically, we ask: how many reads of a given length are necessary in order to perfectly reconstruct with a certain target probability? We focus on a simple read model, with noiseless single-end reads, but consider arbitrary genomic sequences. We first prove a lower bound on the read length and the coverage depth required for reconstruction in terms of the repeat statistics of the genome. Building on known algorithms, we design a de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes. The results are based on a set of necessary and sufficient conditions on the DNA sequence and the reads for reconstruction.

Cover page: Statistical Methods for Genome Assembly

Thesis
Peer Reviewed

Computational Tools for Immune Repertoire Characterization and Primer Set Design

Yu, Jane
Advisor(s): Song, Yun S

UC Berkeley Electronic Theses and Dissertations (2019)

The enormous decrease in the cost of genomic sequencing over the past two decades has enabled researchers to revisit previously unaddressable questions in sequence analysis. However, this boom of genomic information has introduced new sets of problems that often demand computationally efficient methods. In this work, we describe computational tools for two such settings involving large-scale genomic data: 1) estimating copy number and allelic variation in two highly complex gene families, and 2) selective sequencing of a target genome in a complex DNA sample.

We first describe a method that takes short reads from high-throughput sequencing and characterizes both copy number and allelic variation in the IGHV and TRBV loci. These two loci can vary extensively between individuals in copy number and contain genes that are highly similar, making their analysis technically challenging. Additionally, we have conducted the first study of a globally diverse sample of hundreds of individuals in these two loci from over a hundred populations. In addition to providing insight into the different evolutionary paths of the IGHV and TRBV loci, our results are also important to the adaptive immune repertoire sequencing community, where the lack of frequencies of common alleles and copy number variants is hampering existing analytical pipelines.

In our second problem setting, we describe SOAPswga, an optimized and parallelized pipeline for primer design in the context of selective amplification. Unlike previous heuristic-based methods, SOAPswga uses machine learning methods to evaluate both individual primers and primer sets. Additionally, rather than brute force search for primer sets, such as in predecessor methods, SOAPswga uses branch-and-bound principles to pursue only the most promising sets. These optimizations, including the parallelization of each step, allow for a huge decrease in runtime from the order of weeks to minutes. We also discuss the results of our pipeline applied to the selective amplification of Mycobacterium tuberculosis in a sample of human blood. Lastly, we expand on the importance of this work, and in general, its potential usefulness to any setting consisting of targeted sequencing.

Cover page: Computational Tools for Immune Repertoire Characterization and Primer Set Design

Thesis
Peer Reviewed

Scalable Algorithms for Population Genomic Inference

Sheehan, Sara
Advisor(s): Song, Yun S

UC Berkeley Electronic Theses and Dissertations (2015)

Since the 1920s, researchers in population genetics have developed mathematical models to explain how a species evolves. With the rise of DNA sequencing over the past decade, we now have the data to use these models to answer real questions in evolutionary biology. However, the sheer amount of data and the time complexity of the models makes inference extremely challenging. Computer science has therefore become an essential tool for bridging theoretical models and modern sequencing data.

In this thesis we present two novel algorithms that make use of DNA sequencing data in a principled yet practical way. The first method estimates the history of effective population sizes of a species using a coalescent hidden Markov model (HMM). Previous coalescent HMMs could only handle a few sequences, since the set of coalescent trees makes the state- space prohibitively large. Our algorithm uses a modified state-space to make inference computationally feasible while still retaining the essential genealogical features of a sample. We apply this algorithm, called diCal, to human data to learn more about major events in human history, such as the out-of-Africa migration. We also provide several extensions to diCal that make the computation faster, more automated, and applicable in a wider variety of scenarios.

The second method is an algorithm for jointly estimating effective population size changes and natural selection. These two factors can leave similar traces in genomic data, and the models that would describe both are computationally intractable. Our method uses a machine learning technique called deep learning to make the inference procedure robust and efficient. Deep learning automatically teases out important features of the data, but previously had not been used in population genetics. We apply this method to African Drosophila melanogaster data to jointly infer their population size changes and classify each region of their genome as neutral or under natural selection. We considered three types of selection: hard sweeps, soft sweeps, and balancing selection. To create a sophisticated framework for population genomic inference, in the future it would be promising to combine machine learning algorithms with biologically-inspired coalescent modeling.

Cover page: Scalable Algorithms for Population Genomic Inference

Thesis
Peer Reviewed

Statistical, algorithmic, and robustness aspects of population demographic inference from genomic variation data

Bhaskar, Anand
Advisor(s): Song, Yun S

UC Berkeley Electronic Theses and Dissertations (2013)

The recent availability of large-sample high-throughput sequencing data has given us an unprecedented opportunity to very finely resolve the details of historical demographic processes that have shaped the genomes of modern human populations. Such understanding of population demography is important for several applications — to avoid false positives in genome-wide association studies; to calibrate null models of neutral genome evolution in order to find regions under selection; to study the impact of bottlenecks and small founder populations on genetic mutational load; to reconstruct large-scale historical human migration and admixture events; and so on.

In this dissertation, we consider some statistical, algorithmic and robustness aspects of demographic inference from genomic variation data. In particular, we study the problem of determining the historical effective size of a population from the sample frequency spectrum (SFS), which measures the distribution of allele frequencies in a sample of sequences drawn from the population.

From the statistical or information-theoretic perspective, it is known that this inverse problem does not have a unique solution in general, no matter how large the sample size. For any population allele frequency distribution, there exist infinitely many population size functions that are consistent with this distribution. While such a non-identifiability result might appear to pose a serious problem to statistical inference algorithms, we show that the situation is not so bad in practice. In particular, we prove that if the true population size function is piecewise-defined with each piece belonging to some family of biologically-motivated functions, then the SFS of a finite sample of sequences uniquely determines the underlying demography. We obtain a general bound on the sample size sufficient for identifiability; this bound depends on the number of pieces in the demographic model and on the family of functions for each piece. We also give concrete instantiations of this bound for piecewise-constant and piecewise-exponential models that are commonly used in demographic inference analyses.

From the algorithmic perspective, we build on analytic results for the expected SFS of a time-varying population size function and develop an efficient likelihood-based algorithm to infer piecewise-exponential

population size histories from large sample allele frequency data. By considering very large samples, our method can resolve details of the population history from the very recent past that are not otherwise accessible using smaller samples.

The third aspect of this dissertation is concerned with understanding the robustness of widely used evolutionary models to violations of model assumptions. Continuous-time evolutionary models like Kingman's coalescent and its dual diffusion process are derived from discrete models of random mating by assuming that the sample size being analyzed is much smaller than the the population size. However, the very large sample datasets being produced due to advances in high-throughput sequencing technologies are approaching the limits of this assumption. To investigate this issue, we develop exact algorithms for computation under the discrete-time Wright-Fisher model and use these algorithms to study the distortions in several genealogical quantities arising due to the coalescent approximation. Our findings indicate that for several demographic models inferred from large-scale sequence data, there can be substantial genealogical deviations introduced by the coalescent approximation that might influence the results of inference studies.

Cover page: Statistical, algorithmic, and robustness aspects of population demographic inference from genomic variation data

Article
Peer Reviewed

Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright–Fisher models

UC Berkeley Previously Published Works (2009)

Motivation

A fundamental problem in population genetics, which being also of importance to forensic science, is to compute the match probability (MP) that two individuals randomly chosen from a population have identical alleles at a collection of loci. At present, 11-13 unlinked autosomal microsatellite loci are typed for forensic use. In a finite population, the genealogical relationships of individuals can create statistical non-independence of alleles at unlinked loci. However, the so-called product rule, which is used in courts in the USA, computes the MP for multiple unlinked loci by assuming statistical independence, multiplying the one-locus MPs at those loci. Analytically testing the accuracy of the product rule for more than five loci has hitherto remained an open problem.

Results

In this article, we adopt a flexible graphical framework to compute multi-locus MPs analytically. We consider two standard models of random mating, namely the Wright-Fisher (WF) and Moran models. We succeed in computing haplotypic MPs for up to 10 loci in the WF model, and up to 13 loci in the Moran model. For a finite population and a large number of loci, we show that the MPs predicted by the product rule are highly sensitive to mutation rates in the range of interest, while the true MPs computed using our graphical framework are not. Furthermore, we show that the WF and Moran models may produce drastically different MPs for a finite population, and that this difference grows with the number of loci and mutation rates. Although the two models converge to the same coalescent or diffusion limit, in which the population size approaches infinity, we demonstrate that, when multiple loci are considered, the rate of convergence in the Moran model is significantly slower than that in the WF model.

Availability

A C++ implementation of the algorithms discussed in this article is available at http://www.cs.berkeley.edu/ approximately yss/software.html.

Cover page: Multi-locus match probability in a finite population: a fundamental difference between the Moran and Wright–Fisher models

Article
Peer Reviewed

Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data

UC Berkeley Previously Published Works (2014)

The sample frequency spectrum (SFS) is a widely-used summary statistic of genomic variation in a sample of homologous DNA sequences. It provides a highly efficient dimensional reduction of large-scale population genomic data and its mathematical dependence on the underlying population demography is well understood, thus enabling the development of efficient inference algorithms. However, it has been recently shown that very different population demographies can actually generate the same SFS for arbitrarily large sample sizes. Although in principle this nonidentifiability issue poses a thorny challenge to statistical inference, the population size functions involved in the counterexamples are arguably not so biologically realistic. Here, we revisit this problem and examine the identifiability of demographic models under the restriction that the population sizes are piecewise-defined where each piece belongs to some family of biologically-motivated functions. Under this assumption, we prove that the expected SFS of a sample uniquely determines the underlying demographic model, provided that the sample is sufficiently large. We obtain a general bound on the sample size sufficient for identifiability; the bound depends on the number of pieces in the demographic model and also on the type of population size function in each piece. In the cases of piecewise-constant, piecewise-exponential and piecewise-generalized-exponential models, which are often assumed in population genomic inferences, we provide explicit formulas for the bounds as simple functions of the number of pieces. Lastly, we obtain analogous results for the "folded" SFS, which is often used when there is ambiguity as to which allelic type is ancestral. Our results are proved using a generalization of Descartes' rule of signs for polynomials to the Laplace transform of piecewise continuous functions.

Cover page: Descartes’ rule of signs and the identifiability of population demographic models from genomic variation data

Article
Peer Reviewed

A novel spectral method for inferring general diploid selection from time series genetic data

UC Berkeley Previously Published Works (2014)

The increased availability of time series genetic variation data from experimental evolution studies and ancient DNA samples has created new opportunities to identify genomic regions under selective pressure and to estimate their associated fitness parameters. However, it is a challenging problem to compute the likelihood of non-neutral models for the population allele frequency dynamics, given the observed temporal DNA data. Here, we develop a novel spectral algorithm to analytically and efficiently integrate over all possible frequency trajectories between consecutive time points. This advance circumvents the limitations of existing methods which require fine-tuning the discretization of the population allele frequency space when numerically approximating requisite integrals. Furthermore, our method is flexible enough to handle general diploid models of selection where the heterozygote and homozygote fitness parameters can take any values, while previous methods focused on only a few restricted models of selection. We demonstrate the utility of our method on simulated data and also apply it to analyze ancient DNA data from genetic loci associated with coat coloration in horses. In contrast to previous studies, our exploration of the full fitness parameter space reveals that a heterozygote-advantage form of balancing selection may have been acting on these loci.

Cover page: A novel spectral method for inferring general diploid selection from time series genetic data

Thesis
Peer Reviewed

Scalable Machine Learning Algorithms for Biological Sequence Data

Chan, Jeffrey D
Advisor(s): Song, Yun S

UC Berkeley Electronic Theses and Dissertations (2021)

Recent advances in sequencing and synthesis technologies have sparked extraordinary growth in large-scale biological experimentation and data collection. This explosive growth necessitates the development of scalable yet accurate methods to investigate increasingly complex biological questions. Machine learning has become a vital tool for addressing the needs of computational biology blending complex statistical models with efficient computation to uncover the underpinnings of biology.

In this dissertation, I develop three novel machine learning algorithms tailored towards biological sequence data to aid in answering such biological questions. The first method is a general-purpose statistical framework for inference of population genetic parameters. Previous methods focused on developing model approximation methods for a restricted class of models or reducing datasets to a set of hand-crafted summary statistics and comparing them against simulated data. Our framework uses a exchangeable neural network which respects the permutation-invariant symmetries of the data to learn the mapping from simulated datasets to the population genetic parameters of interest.

The second method extends the ideas from the first method to a more challenging setting where segmentation of the genotypes is necessary to determine tracts of archaic admixture. In this setting, the data are permutation-equivariant requiring a neural network architecture that results in accurate segmentation of archaic admixture tracts.

Finally, the third method focuses on the problem of search in protein engineering to discover high fitness protein sequences of interest. Standard bandit optimization methods often focus on experimental feedback that is purely sequential. In protein engineering, advances in high-throughput synthesis and experimentation can often lead to large batches of size as large as 10^5 where the size of the batch can often be much larger than the number of rounds of experimentation. We propose a family of parallel contextual linear bandit algorithms and analyze their regret bounds.

Cover page: Scalable Machine Learning Algorithms for Biological Sequence Data

Thesis
Peer Reviewed

Statistical Methods and Analyses in Computational Genomics: Explorations of Eukaryotic Transcription

Fischer, Jonathan Robert
Advisor(s): Song, Yun S

UC Berkeley Electronic Theses and Dissertations (2018)

The introduction of next-generation, or high-throughput, sequencing techniques has fundamentally altered our perception of the genome and transcriptome by permitting the simultaneous study of tens of thousands of distinct transcripts. In recent years, the popularity of next-generation sequencing has risen due to reductions in costs and the steady accumulation of novel genetic and genomic discoveries which would have proven difficult to uncover with older approaches. The continued proliferation of these techniques both in number and frequency of use has resulted in unique data types and experimental structures which require analysis and frequently methodological development.

In this dissertation, I explore eukaryotic transcription from multiple perspectives by applying both classical and novel statistical methods to data generated by different next-generation sequencing protocols. I begin by constructing spatial nascent transcription profiles based on various RNA Polymerase II footprinting procedures to demonstrate the profound deleterious effect of RNA transcript decay factor deletion on mRNA production in yeast, bolsterering and expanding upon prior evidence of the inextricable link between RNA synthesis and decay. My focus then shifts to the development and application of a tensor-based method to RNA-seq data of both the bulk and single-cell varieties. This method is intended for use with data produced in experiments with increasingly-common specially-structured designs in which samples share tissues and/or individuals, and I show that it more robustly and powerfully characterizes the transcriptome via simulation and application to human bulk gene expression measurements. I conclude by employing this method jointly with traditional approaches to investigate the tissue-specific effects on gene expression as measured in murine single-cell RNA-seq and discuss the merits of tensor methods in such a setting.

Cover page: Statistical Methods and Analyses in Computational Genomics: Explorations of Eukaryotic Transcription