Search

Scholarly Works (72 results)

Sort By:

Show:

Article
Peer Reviewed

PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

UCLA Previously Published Works (2021)

To investigate molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along the pseudotime inferred from single-cell RNA-sequencing data. However, existing methods do not account for pseudotime inference uncertainty, and they have either ill-posed p-values or restrictive models. Here we propose PseudotimeDE, a DE gene identification method that adapts to various pseudotime inference methods, accounts for pseudotime inference uncertainty, and outputs well-calibrated p-values. Comprehensive simulations and real-data applications verify that PseudotimeDE outperforms existing methods in false discovery rate control and power.

Cover page: PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

Article
Peer Reviewed

Statistical hypothesis testing versus machine-learning binary classification: distinctions and guidelines.

Research Reports (2020)

Article
Peer Reviewed

Erratum to: ‘NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data’

UCLA Previously Published Works (2016)

Creative Commons 'BY-NC' version 4.0 license

Article
Peer Reviewed

Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons

UCLA Previously Published Works (2017)

Background

We report a statistical study to find correspondence of D. melanogaster and C. elegans developmental stages based on alternative splicing (AS) characteristics of conserved cassette exons using modENCODE RNA-seq data. We identify "stage-associated exons" to capture the AS characteristics of each stage and use these exons to map pairwise stages within and between the two species by an overlap test.

Results

Within fly and worm, adjacent developmental stages are mapped to each other, i.e., a strong diagonal pattern is observed as expected, supporting the validity of our approach. Between fly and worm, two parallel mapping patterns are observed between fly early embryos to early larvae and worm life cycle, and between fly late larvae to adults and worm late embryos to adults. We also apply this approach to compare tissues and cells from fly and worm. Findings include the high similarity between fly/worm adults and fly/worm embryos, groupings of fly cell lines, and strong mappings of fly head tissues to worm late embryos and male adults. Gene ontology and KEGG enrichment analyses provide a detailed functional annotation of the identified stage-associated exons, as well as a functional explanation of the observed correspondence map between fly and worm developmental stages.

Conclusions

Our results suggest that AS dynamics of the exon pairs that share similar DNA sequences are informative for finding transcriptomic similarity of biological samples. Our study is innovative in two aspects. First, to our knowledge, our study is the first comprehensive study of AS events in fly and worm developmental stages, tissues, and cells. AS events provide an alternative perspective of transcriptome dynamics, compared to gene expression events. Second, our results do not entirely rely on the information of orthologous genes. Interesting results are also observed for fly and worm cassette exon pairs with DNA sequence similarity but not in orthologous gene pairs.

Cover page: Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons

Thesis
Peer Reviewed

Systematic Identification and Analysis of Cell-state-associated cisregulatory Elements Using Statistical Approaches

Yang, Yucheng
Advisor(s): Li, Jingyi Jessica

UCLA Electronic Theses and Dissertations (2017)

Recent genome-wide studies have significantly advanced our understanding of the non-coding genome in higher eukaryotes. Here we developed a novel computational method to systematically identify cell-state-associated cis-regulatory elements for more than 300 cell and tissue types from human and mouse. Our method identified strong enrichment of associated enhancers with immune cells. We found that the cis-regulatory elements associated with more cell and tissue types exhibit certain genomic features, including longer length, higher conservation score and enrichment of CpG-islands. We identified enriched transcription factor (TF) motifs within the enhancers associated with each cell and tissue type. We also found that the single nucleotide polymorphisms (SNPs) identified by the Genome-Wide Association Study (GWAS) are particularly enriched in the cell-state-associated enhancers. Furthermore, we analyzed the association between human diseases and various cell and tissue types, and found that sclerosis diseases are associated with diverse immune-associated tissues and mature immune cells. Finally, we estimated enhancer-promoter signal correlations and identified enhancers exhibiting conserved correlations between human and mouse.

Cover page: Systematic Identification and Analysis of Cell-state-associated cisregulatory Elements Using Statistical Approaches

Article
Peer Reviewed

NMFP: a non-negative matrix factorization based preselection method to increase accuracy of identifying mRNA isoforms from RNA-seq data

UCLA Previously Published Works (2016)

Background

The advent of next-generation RNA sequencing (RNA-seq) has greatly advanced transcriptomic studies, including system-wide identification and quantification of mRNA isoforms under various biological conditions. A number of computational methods have been developed to systematically identify mRNA isoforms in a high-throughput manner from RNA-seq data. However, a common drawback of these methods is that their identified mRNA isoforms contain a high percentage of false positives, especially for genes with complex splicing structures, e.g., many exons and exon junctions.

Results

We have developed a preselection method called "Non-negative Matrix Factorization Preselection" (NMFP) which is designed to improve the accuracy of computational methods in identifying mRNA isoforms from RNA-seq data. We demonstrated through simulation and real data studies that NMFP can effectively shrink the search space of isoform candidates and increase the accuracy of two mainstream computational methods, Cufflinks and SLIDE, in their identification of mRNA isoforms.

Conclusion

NMFP is a useful tool to preselect mRNA isoform candidates for downstream isoform discovery methods. It can greatly reduce the number of isoform candidates while maintaining a good coverage of unknown true isoforms. Adding NMFP as an upstream step, computational methods are expected to achieve better accuracy in identifying mRNA isoforms from RNA-seq data.

Article
Peer Reviewed

Statistical Hypothesis Testing versus Machine Learning Binary Classification: Distinctions and Guidelines

UCLA Previously Published Works (2020)

Making binary decisions is a common data analytical task in scientific research and industrial applications. In data sciences, there are two related but distinct strategies: hypothesis testing and binary classification. In practice, how to choose between these two strategies can be unclear and rather confusing. Here, we summarize key distinctions between these two strategies in three aspects and list five practical guidelines for data analysts to choose the appropriate strategy for specific analysis needs. We demonstrate the use of those guidelines in a cancer driver gene prediction example.

Article
Peer Reviewed

Issues arising from benchmarking single-cell RNA sequencing imputation methods

UCLA Previously Published Works (2019)

On June 25th, 2018, Huang et al. published a computational method SAVER on Nature Methods for imputing dropout gene expression levels in single cell RNA sequencing (scRNA-seq) data. Huang et al. performed a set of comprehensive benchmarking analyses, including comparison with the data from RNA fluorescence in situ hybridization, to demonstrate that SAVER outperformed two existing scRNA-seq imputation methods, scImpute and MAGIC. However, their computational analyses were based on semi-synthetic data that the authors had generated following the Poisson-Gamma model used in the SAVER method. We have therefore re-examined Huang et al.'s study. We find that the semi-synthetic data have very different properties from those of real scRNA-seq data and that the cell clusters used for benchmarking are inconsistent with the cell types labeled by biologists. We show that a reanalysis based on real scRNA-seq data and grounded on biological knowledge of cell types leads to different results and conclusions from those of Huang et al.

Cover page: Issues arising from benchmarking single-cell RNA sequencing imputation methods

Article
Peer Reviewed

A statistical simulator scDesign for rational scRNA-seq experimental design

UCLA Previously Published Works (2018)

Motivation

Single-cell RNA-sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths, and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information.

Results

Here we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and six different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experiment design based on specific research goals and compares various scRNA-seq computational methods.

Availability

We have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/scDesign .

Contact

jli@stat.ucla.edu

Article
Peer Reviewed

A flexible model-free prediction-based framework for feature ranking

UCLA Previously Published Works (2019)

Despite the availability of numerous statistical and machine learning tools for joint feature modeling, many scientists investigate features marginally, i.e., one feature at a time. This is partly due to training and convention but also roots in scientists' strong interests in simple visualization and interpretability. As such, marginal feature ranking for some predictive tasks, e.g., prediction of cancer driver genes, is widely practiced in the process of scientific discoveries. In this work, we focus on marginal ranking for binary prediction, the arguably most common predictive tasks. We argue that the most widely used marginal ranking criteria, including the Pearson correlation, the two-sample t test, and two-sample Wilcoxon rank-sum test, do not fully take feature distributions and prediction objectives into account. To address this gap in practice, we propose two ranking criteria corresponding to two prediction objectives: the classical criterion (CC) and the Neyman-Pearson criterion (NPC), both of which use model-free nonparametric implementations to accommodate diverse feature distributions. Theoretically, we show that under regularity conditions both criteria achieve sample-level ranking consistent with their population-level counterpart with high probability. Moreover, NPC is robust to sampling bias when the two class proportions in a sample deviate from those in the population. This property endows NPC good potential in biomedical research where sampling bias is common. We demonstrate the use and relative advantages of CC and NPC in simulation and real data studies. Our model-free objective-based ranking idea is extendable to ranking feature subsets and generalizable to other prediction tasks and learning objectives.