Search

Scholarly Works (13 results)

Sort By:

Show:

Thesis
Peer Reviewed

Efficient Algorithms for Human Genetic Variation Detection using High-throughput Sequencing Techniques

He, Dan
Advisor(s): Eskin, Eleazar

UCLA Electronic Theses and Dissertations (2012)

High-throughput sequencing (HTS) technologies are one type of genome sequencing techniques where short DNA segments, or reads, are sequenced or sampled from genome. Compared with the traditional genome sequencing techniques, they have advantages such as low-cost and they are able to parallelize the sequencing process to produce millions of reads. These technologies have been widely used in many important problems related to human genetic variations. We mainly target three human genetic variation problems with the reads generated by HTS.

It is well-known that human individuals differ from each other by 0.1%. The majority of the differences is in the form of SNPs, or Single Nucleotide Polymophisms. Haplotypes, defined as the sequences of SNPs on each chromosome of a human genome, are important for problems such as imputation of genetic variants, relatedness of human individuals, etc. A difficulty in haplotype inference is the presence of sequencing errors and a natural formulation of the problem is to infer haplotypes which are most consistent with the data from a combinatorial perspective. Unfortunately, this formulation of the haplotype assembly is known to be NP-hard. We proposed a few techniques including dynamic programming, MaxSAT and Hidden Markov Model (HMM) to solve the problem optimally from different perspectives.

Structural variations and in particular Copy Number Variations

(CNV) have dramatic effects of disease and traits. We first proposed

an efficient algorithm to detect and reconstruct CNVs in unique genomic regions,

where the sequencing reads generated from HTS are mapped to a

reference genome and signatures indicating the presence of a CNV

are identified. Then we extend the algorithm to a much more challenging problem where CNVs are in repeat-rich regions and the reads may be mapped to multiple mapping positions. To our knowledge, our method is the first attempt to both identify and reconstruct CNVs in repeat-rich regions.

Recent advances in sequencing technologies set the stage for large population based studies, in which the DNA or RNA of thousands of individuals will be sequenced. A few multiplexing schemes have been suggested, in which a small number of DNA pools are sequenced, and the results are then deconvoluted using compressed sensing or similar approaches. These methods, however, are limited to the detection of rare

variants. We provide a new algorithm for the deconvolution of DNA pools multiplexing schemes. The presented algorithm utilizes a likelihood model and linear programming and is able to genotype both low and high allele frequency SNPs with microarray genotyping and imputation.

Cover page: Efficient Algorithms for Human Genetic Variation Detection using High-throughput Sequencing Techniques

Article
Peer Reviewed

Bioinspired Thiophosphorodichloridate Reagents for Chemoselective Histidine Bioconjugation

UC Berkeley Previously Published Works (2019)

Site-selective bioconjugation to native protein residues is a powerful tool for protein functionalization, with cysteine and lysine side chains being the most common points for attachment owing to their high nucleophilicity. We now report a strategy for histidine modification using thiophosphorodichloridate reagents that mimic post-translational histidine phosphorylation, enabling fast and selective labeling of protein histidines under mild conditions where various payloads can be introduced via copper-assisted alkyne-azide cycloaddition (CuAAC) chemistry. We establish that these reagents are particularly effective at covalent modification of His-tags, which are common motifs to facilitate protein purification, as illustrated by selective attachment of polyarginine cargoes to enhance the uptake of proteins into living cells. This work provides a starting point for probing and enhancing protein function using histidine-directed chemistry.

Cover page: Bioinspired Thiophosphorodichloridate Reagents for Chemoselective Histidine Bioconjugation

Article
Peer Reviewed

Detection and reconstruction of tandemly organized de novo copy number variations

UCLA Previously Published Works (2010)

Abstract Background The characterization of structural variations (SV) such as insertions, deletions and copy number variations is a critical step in the process of understanding the full genetic architecture of organisms. Copy number variations (CNV) have attracted much recent attention due to their effects on gene expression and disease status. Results In this paper, we present a method that utilizes next-generation sequencing technologies (NGS), in order to both detect and reconstruct CNVs. We focus on a special type of CNV, namely tandemly organized de novo CNVs, which have been shown to occur with high frequency in the mouse genome. Conclusions We apply our method to CNV regions randomly inserted into the reference mouse genome and show that our method achieves good performance for both detection and reconstruction of tandemly organized de novo CNVs.

Cover page: Detection and reconstruction of tandemly organized de novo copy number variations

Article
Peer Reviewed

Optimal algorithms for haplotype assembly from whole-genome sequence data

UCLA Previously Published Works (2010)

Motivation

Haplotype inference is an important step for many types of analyses of genetic variation in the human genome. Traditional approaches for obtaining haplotypes involve collecting genotype information from a population of individuals and then applying a haplotype inference algorithm. The development of high-throughput sequencing technologies allows for an alternative strategy to obtain haplotypes by combining sequence fragments. The problem of 'haplotype assembly' is the problem of assembling the two haplotypes for a chromosome given the collection of such fragments, or reads, and their locations in the haplotypes, which are pre-determined by mapping the reads to a reference genome. Errors in reads significantly increase the difficulty of the problem and it has been shown that the problem is NP-hard even for reads of length 2. Existing greedy and stochastic algorithms are not guaranteed to find the optimal solutions for the haplotype assembly problem.

Results

In this article, we proposed a dynamic programming algorithm that is able to assemble the haplotypes optimally with time complexity O(m x 2(k) x n), where m is the number of reads, k is the length of the longest read and n is the total number of SNPs in the haplotypes. We also reduce the haplotype assembly problem into the maximum satisfiability problem that can often be solved optimally even when k is large. Taking advantage of the efficiency of our algorithm, we perform simulation experiments demonstrating that the assembly of haplotypes using reads of length typical of the current sequencing technologies is not practical. However, we demonstrate that the combination of this approach and the traditional haplotype phasing approaches allow us to practically construct haplotypes containing both common and rare variants.

Cover page: Optimal algorithms for haplotype assembly from whole-genome sequence data

Article
Peer Reviewed

Genotyping common and rare variation using overlapping pool sequencing

UCLA Previously Published Works (2011)

Abstract Background Recent advances in sequencing technologies set the stage for large, population based studies, in which the ANA or RNA of thousands of individuals will be sequenced. Currently, however, such studies are still infeasible using a straightforward sequencing approach; as a result, recently a few multiplexing schemes have been suggested, in which a small number of ANA pools are sequenced, and the results are then deconvoluted using compressed sensing or similar approaches. These methods, however, are limited to the detection of rare variants. Results In this paper we provide a new algorithm for the deconvolution of DNA pools multiplexing schemes. The presented algorithm utilizes a likelihood model and linear programming. The approach allows for the addition of external data, particularly imputation data, resulting in a flexible environment that is suitable for different applications. Conclusions Particularly, we demonstrate that both low and high allele frequency SNPs can be accurately genotyped when the DNA pooling scheme is performed in conjunction with microarray genotyping and imputation. Additionally, we demonstrate the use of our framework for the detection of cancer fusion genes from RNA sequences.

Cover page: Genotyping common and rare variation using overlapping pool sequencing

Article
Peer Reviewed

IPED: Inheritance Path-based Pedigree Reconstruction Algorithm Using Genotype Data

UCLA Previously Published Works (2013)

The problem of inference of family trees, or pedigree reconstruction, for a group of individuals is a fundamental problem in genetics. Various methods have been proposed to automate the process of pedigree reconstruction given the genotypes or haplotypes of a set of individuals. Current methods, unfortunately, are very time-consuming and inaccurate for complicated pedigrees, such as pedigrees with inbreeding. In this work, we propose an efficient algorithm that is able to reconstruct large pedigrees with reasonable accuracy. Our algorithm reconstructs the pedigrees generation by generation, backward in time from the extant generation. We predict the relationships between individuals in the same generation using an inheritance path-based approach implemented with an efficient dynamic programming algorithm. Experiments show that our algorithm runs in linear time with respect to the number of reconstructed generations, and therefore, it can reconstruct pedigrees that have a large number of generations. Indeed it is the first practical method for reconstruction of large pedigrees from genotype data.

Article
Peer Reviewed

Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models

UCLA Previously Published Works (2016)

Although genome-wide association studies (GWASs) have discovered numerous novel genetic variants associated with many complex traits and diseases, those genetic variants typically explain only a small fraction of phenotypic variance. Factors that account for phenotypic variance include environmental factors and gene-by-environment interactions (GEIs). Recently, several studies have conducted genome-wide gene-by-environment association analyses and demonstrated important roles of GEIs in complex traits. One of the main challenges in these association studies is to control effects of population structure that may cause spurious associations. Many studies have analyzed how population structure influences statistics of genetic variants and developed several statistical approaches to correct for population structure. However, the impact of population structure on GEI statistics in GWASs has not been extensively studied and nor have there been methods designed to correct for population structure on GEI statistics. In this paper, we show both analytically and empirically that population structure may cause spurious GEIs and use both simulation and two GWAS datasets to support our finding. We propose a statistical approach based on mixed models to account for population structure on GEI statistics. We find that our approach effectively controls population structure on statistics for GEIs as well as for genetic variants.

Cover page: Accounting for Population Structure in Gene-by-Environment Interactions in Genome-Wide Association Studies Using Mixed Models

Article
Peer Reviewed

Distinct RNA N-demethylation pathways catalyzed by nonheme iron ALKBH5 and FTO enzymes enable regulation of formaldehyde release rates

UC Berkeley Previously Published Works (2020)

The AlkB family of nonheme Fe(II)/2-oxoglutarate-dependent oxygenases are essential regulators of RNA epigenetics by serving as erasers of one-carbon marks on RNA with release of formaldehyde (FA). Two major human AlkB family members, FTO and ALKBH5, both act as oxidative demethylases of N6-methyladenosine (m6A) but furnish different major products, N6-hydroxymethyladenosine (hm6A) and adenosine (A), respectively. Here we identify foundational mechanistic differences between FTO and ALKBH5 that promote these distinct biochemical outcomes. In contrast to FTO, which follows a traditional oxidative N-demethylation pathway to catalyze conversion of m6A to hm6A with subsequent slow release of A and FA, we find that ALKBH5 catalyzes a direct m6A-to-A transformation with rapid FA release. We identify a catalytic R130/K132/Y139 triad within ALKBH5 that facilitates release of FA via an unprecedented covalent-based demethylation mechanism with direct detection of a covalent intermediate. Importantly, a K132Q mutant furnishes an ALKBH5 enzyme with an m6A demethylation profile that resembles that of FTO, establishing the importance of this residue in the proposed covalent mechanism. Finally, we show that ALKBH5 is an endogenous source of FA in the cell by activity-based sensing of FA fluxes perturbed via ALKBH5 knockdown. This work provides a fundamental biochemical rationale for nonredundant roles of these RNA demethylases beyond different substrate preferences and cellular localization, where m6A demethylation by ALKBH5 versus FTO results in release of FA, an endogenous one-carbon unit but potential genotoxin, at different rates in living systems.

Cover page: Distinct RNA N-demethylation pathways catalyzed by nonheme iron ALKBH5 and FTO enzymes enable regulation of formaldehyde release rates

Article
Peer Reviewed

Identifying genetic relatives without compromising privacy

UCLA Previously Published Works (2014)

The development of high-throughput genomic technologies has impacted many areas of genetic research. While many applications of these technologies focus on the discovery of genes involved in disease from population samples, applications of genomic technologies to an individual's genome or personal genomics have recently gained much interest. One such application is the identification of relatives from genetic data. In this application, genetic information from a set of individuals is collected in a database, and each pair of individuals is compared in order to identify genetic relatives. An inherent issue that arises in the identification of relatives is privacy. In this article, we propose a method for identifying genetic relatives without compromising privacy by taking advantage of novel cryptographic techniques customized for secure and private comparison of genetic information. We demonstrate the utility of these techniques by allowing a pair of individuals to discover whether or not they are related without compromising their genetic information or revealing it to a third party. The idea is that individuals only share enough special-purpose cryptographically protected information with each other to identify whether or not they are relatives, but not enough to expose any information about their genomes. We show in HapMap and 1000 Genomes data that our method can recover first- and second-order genetic relationships and, through simulations, show that our method can identify relationships as distant as third cousins while preserving privacy.

Cover page: Identifying genetic relatives without compromising privacy

Article
Peer Reviewed

Effects of PDE4 pathway inhibition in rat experimental stroke.

UC Irvine Previously Published Works (2014)

Purpose

The first genomewide association study indicated that variations in the phosphodiesterase 4D (PDE4D) gene confer risk for ischemic stroke. However, inconsistencies among the studies designed to replicate the findings indicated the need for further investigation to elucidate the role of the PDE4 pathway in stroke pathogenesis. Hence, we studied the effect of global inhibition of the PDE4 pathway in two rat experimental stroke models, using the PDE4 inhibitor rolipram. Further, the specific role of the PDE4D isoform in ischemic stroke pathogenesis was studied using PDE4D knockout rats in experimental stroke.

Methods

Rats were subjected to either the ligation or embolic stroke model and treated with rolipram (3mg/kg; i.p.) prior to the ischemic insult. Similarly, the PDE4D knockout rats were subjected to experimental stroke using the embolic model.

Results

Global inhibition of the PDE4 pathway using rolipram produced infarcts that were 225% (p<0.01) and 138% (p<0.05) of control in the ligation and embolic models, respectively. PDE4D knockout rats subjected to embolic stroke showed no change in infarct size compared to wild-type control.

Conclusions

Despite increase in infarct size after global inhibition of the PDE4 pathway with rolipram, specific inhibition of the PDE4D isoform had no effect on experimental stroke. These findings support a role for the PDE4 pathway, independent of the PDE4D isoform, in ischemic stroke pathogenesis.

Cover page: Effects of PDE4 pathway inhibition in rat experimental stroke.