Search

Article
Peer Reviewed

OMMA enables population-scale analysis of complex genomic features and phylogenomic relationships from nanochannel-based optical maps

UC San Francisco Previously Published Works (2019)

Background

Optical mapping is an emerging technology that complements sequencing-based methods in genome analysis. It is widely used in improving genome assemblies and detecting structural variations by providing information over much longer (up to 1 Mb) reads. Current standards in optical mapping analysis involve assembling optical maps into contigs and aligning them to a reference, which is limited to pairwise comparison and becomes bias-prone when analyzing multiple samples.

Findings

We present a new method, OMMA, that extends optical mapping to the study of complex genomic features by simultaneously interrogating optical maps across many samples in a reference-independent manner. OMMA captures and characterizes complex genomic features, e.g., multiple haplotypes, copy number variations, and subtelomeric structures when applied to 154 human samples across the 26 populations sequenced in the 1000 Genomes Project. For small genomes such as pathogenic bacteria, OMMA accurately reconstructs the phylogenomic relationships and identifies functional elements across 21 Acinetobacter baumannii strains.

Conclusions

With the increasing data throughput of optical mapping system, the use of this technology in comparative genome analysis across many samples will become feasible. OMMA is a timely solution that can address such computational need. The OMMA software is available at https://github.com/TF-Chan-Lab/OMTools.

Cover page: OMMA enables population-scale analysis of complex genomic features and phylogenomic relationships from nanochannel-based optical maps

Article
Peer Reviewed

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors

UC Berkeley Previously Published Works (2012)

Abstract Background Transcription factors function by binding different classes of regulatory elements. The Encyclopedia of DNA Elements (ENCODE) project has recently produced binding data for more than 100 transcription factors from about 500 ChIP-seq experiments in multiple cell types. While this large amount of data creates a valuable resource, it is nonetheless overwhelmingly complex and simultaneously incomplete since it covers only a small fraction of all human transcription factors. Results As part of the consortium effort in providing a concise abstraction of the data for facilitating various types of downstream analyses, we constructed statistical models that capture the genomic features of three paired types of regions by machine-learning methods: firstly, regions with active or inactive binding; secondly, those with extremely high or low degrees of co-binding, termed HOT and LOT regions; and finally, regulatory modules proximal or distal to genes. From the distal regulatory modules, we developed computational pipelines to identify potential enhancers, many of which were validated experimentally. We further associated the predicted enhancers with potential target transcripts and the transcription factors involved. For HOT regions, we found a significant fraction of transcription factor binding without clear sequence motifs and showed that this observation could be related to strong DNA accessibility of these regions. Conclusions Overall, the three pairs of regions exhibit intricate differences in chromosomal locations, chromatin features, factors that bind them, and cell-type specificity. Our machine learning approach enables us to identify features potentially general to all transcription factors, including those not included in the data.

Cover page: Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors

Article
Peer Reviewed

OMBlast: alignment tool for optical mapping using a seed-and-extend approach

UC San Francisco Previously Published Works (2017)

Background

Optical mapping is a technique for capturing fluorescent signal patterns of long DNA molecules (in the range of 0.1–1 Mbp). Recently, it has been complementing the widely used short-read sequencing technology by assisting with scaffolding and detecting large and complex structural variations (SVs). Here, we introduce a fast, robust and accurate tool called OMBlast for aligning optical maps, the set of signal locations on the molecules generated from optical mapping. Our method is based on the seed-and-extend approach from sequence alignment, with modifications specific to optical mapping.

Results

Experiments with both synthetic and our real data demonstrate that OMBlast has higher accuracy and faster mapping speed than existing alignment methods. Our tool also shows significant improvement when aligning data with SVs.

Availability and implementation

OMBlast is implemented for Java 1.7 and is released under a GPL license. OMBlast can be downloaded from https://github.com/aldenleung/OMBlast and run directly on machines equipped with a Java virtual machine.

Contact

kevinyip@cse.cuhk.edu.hk and tf.chan@cuhk.edu.hk

Supplementary information

Supplementary data are available at Bioinformatics online.

Cover page: OMBlast: alignment tool for optical mapping using a seed-and-extend approach

Article
Peer Reviewed

OMSV enables accurate and comprehensive identification of large structural variations from nanochannel-based single-molecule optical maps

UC San Francisco Previously Published Works (2017)

We present a new method, OMSV, for accurately and comprehensively identifying structural variations (SVs) from optical maps. OMSV detects both homozygous and heterozygous SVs, SVs of various types and sizes, and SVs with or without creating or destroying restriction sites. We show that OMSV has high sensitivity and specificity, with clear performance gains over the latest method. Applying OMSV to a human cell line, we identified hundreds of SVs >2 kbp, with 68 % of them missed by sequencing-based callers. Independent experimental validation confirmed the high accuracy of these SVs. The OMSV software is available at http://yiplab.cse.cuhk.edu.hk/omsv/ .

Cover page: OMSV enables accurate and comprehensive identification of large structural variations from nanochannel-based single-molecule optical maps

Article
Peer Reviewed

Genome maps across 26 human populations reveal population-specific patterns of structural variation

UC San Francisco Previously Published Works (2019)

Large structural variants (SVs) in the human genome are difficult to detect and study by conventional sequencing technologies. With long-range genome analysis platforms, such as optical mapping, one can identify large SVs (>2 kb) across the genome in one experiment. Analyzing optical genome maps of 154 individuals from the 26 populations sequenced in the 1000 Genomes Project, we find that phylogenetic population patterns of large SVs are similar to those of single nucleotide variations in 86% of the human genome, while ~2% of the genome has high structural complexity. We are able to characterize SVs in many intractable regions of the genome, including segmental duplications and subtelomeric, pericentromeric, and acrocentric areas. In addition, we discover ~60 Mb of non-redundant genome content missing in the reference genome sequence assembly. Our results highlight the need for a comprehensive set of alternate haplotypes from different populations to represent SV patterns in the genome.

Cover page: Genome maps across 26 human populations reveal population-specific patterns of structural variation

Article
Peer Reviewed

Supervised enhancer prediction with epigenetic pattern recognition and targeted validation

UC Merced Previously Published Works (2020)

Enhancers are important non-coding elements, but they have traditionally been hard to characterize experimentally. The development of massively parallel assays allows the characterization of large numbers of enhancers for the first time. Here, we developed a framework using Drosophila STARR-seq to create shape-matching filters based on meta-profiles of epigenetic features. We integrated these features with supervised machine-learning algorithms to predict enhancers. We further demonstrated that our model could be transferred to predict enhancers in mammals. We comprehensively validated the predictions using a combination of in vivo and in vitro approaches, involving transgenic assays in mice and transduction-based reporter assays in human cell lines (153 enhancers in total). The results confirmed that our model can accurately predict enhancers in different species without re-parameterization. Finally, we examined the transcription factor binding patterns at predicted enhancers versus promoters. We demonstrated that these patterns enable the construction of a secondary model that effectively distinguishes enhancers and promoters.

Article
Peer Reviewed

Genome-Wide Structural Variation Detection by Genome Mapping on Nanochannel Arrays

UC San Francisco Previously Published Works (2016)

Comprehensive whole-genome structural variation detection is challenging with current approaches. With diploid cells as DNA source and the presence of numerous repetitive elements, short-read DNA sequencing cannot be used to detect structural variation efficiently. In this report, we show that genome mapping with long, fluorescently labeled DNA molecules imaged on nanochannel arrays can be used for whole-genome structural variation detection without sequencing. While whole-genome haplotyping is not achieved, local phasing (across >150-kb regions) is routine, as molecules from the parental chromosomes are examined separately. In one experiment, we generated genome maps from a trio from the 1000 Genomes Project, compared the maps against that derived from the reference human genome, and identified structural variations that are >5 kb in size. We find that these individuals have many more structural variants than those published, including some with the potential of disrupting gene function or regulation.

Cover page: Genome-Wide Structural Variation Detection by Genome Mapping on Nanochannel Arrays

Article
Peer Reviewed

Establishment and characterization of new tumor xenografts and cancer cell lines from EBV-positive nasopharyngeal carcinoma

UC Davis Previously Published Works (2018)

The lack of representative nasopharyngeal carcinoma (NPC) models has seriously hampered research on EBV carcinogenesis and preclinical studies in NPC. Here we report the successful growth of five NPC patient-derived xenografts (PDXs) from fifty-eight attempts of transplantation of NPC specimens into NOD/SCID mice. The take rates for primary and recurrent NPC are 4.9% and 17.6%, respectively. Successful establishment of a new EBV-positive NPC cell line, NPC43, is achieved directly from patient NPC tissues by including Rho-associated coiled-coil containing kinases inhibitor (Y-27632) in culture medium. Spontaneous lytic reactivation of EBV can be observed in NPC43 upon withdrawal of Y-27632. Whole-exome sequencing (WES) reveals a close similarity in mutational profiles of these NPC PDXs with their corresponding patient NPC. Whole-genome sequencing (WGS) further delineates the genomic landscape and sequences of EBV genomes in these newly established NPC models, which supports their potential use in future studies of NPC.

Cover page: Establishment and characterization of new tumor xenografts and cancer cell lines from EBV-positive nasopharyngeal carcinoma

Article
Peer Reviewed

Exome and genome sequencing of nasopharynx cancer identifies NF-κB pathway activating mutations

UC San Francisco Previously Published Works (2017)

Nasopharyngeal carcinoma (NPC) is an aggressive head and neck cancer characterized by Epstein-Barr virus (EBV) infection and dense lymphocyte infiltration. The scarcity of NPC genomic data hinders the understanding of NPC biology, disease progression and rational therapy design. Here we performed whole-exome sequencing (WES) on 111 micro-dissected EBV-positive NPCs, with 15 cases subjected to further whole-genome sequencing (WGS), to determine its mutational landscape. We identified enrichment for genomic aberrations of multiple negative regulators of the NF-κB pathway, including CYLD, TRAF3, NFKBIA and NLRC5, in a total of 41% of cases. Functional analysis confirmed inactivating CYLD mutations as drivers for NPC cell growth. The EBV oncoprotein latent membrane protein 1 (LMP1) functions to constitutively activate NF-κB signalling, and we observed mutual exclusivity among tumours with somatic NF-κB pathway aberrations and LMP1-overexpression, suggesting that NF-κB activation is selected for by both somatic and viral events during NPC pathogenesis.

Cover page: Exome and genome sequencing of nasopharynx cancer identifies NF-κB pathway activating mutations