With the blossom of bio-chemical technologies in recent years, large and diverse data from every branch of biology has been generated. These data contain insightful truth of science and always present challenges to modeling, computation and interpretation. In this work, I present statistical models for two types of bioinformatic data: RNA-Seq alternative splicing and GCMS metabolomics. R packages grMATS and gcmsDecon are available for download.
The next-generation sequencing produces rich RNA-Sequencing data, where we observe alternative splicing events. Replicate multivariate analysis of transcript splicing (rMATS) has shown advantages over other existing methods for detection of differential alternative splicing from replicate RNA-Seq data. However, the current framework of rMATS only deals with two-isoform splicing events, which limits its usage. In this paper, we present a generalized rMATS framework to deal with multiple isoform splicing events and the model could also be extended to compare differential splicing between multiple groups. We provide a generalized likelihood ratio test where the null hypothesis allows user-defined threshold of splicing change for isoforms. We show that our test statistic follow a mixture of chi-square distributions where the coefficients depend on values of the true parameters and a least favorable test statistic is computed when true parameters are unknown. We show efficacy of our model in both 27+3 simulations and a real dataset. Due to the huge demand for methods on multiple isoform RNA-Seq data, our model will be useful in RNA-Seq research projects.
As a collection of metabolic end-products, metabolome reflects the overall activity of the metabolic network and has been playing an important role in modern bio-chemical researches. Monitoring metabolites and relating their changes to the influence of other factors is a major scientific interest. The technology of Gas Chromatograpy - Mass Spectrometry (GCMS) produces from biological samples a metabolomic data type where each metabolite is broken into different masses (their relative proportions form a mass spectrum s) and co-elute within a retention time range where the spectrum is unchanged. This unique signature data structure enables individual metabolite identification and allows library construction for the whole metabolome. However, GCMS is unable to clearly separate different metabolite elutions, which poses a challenging problem of deconvolution and library matching. In addition, studies of metabolome usually involve multiple biological samples in order to understand which metabolites are related to diseases. Building the multiple correspondence across all samples further complicates the task. We propose an automatic rank-based non-negative matrix factorization model to streamline the spectral deconvolution, multiple corrspondence, metabolite selection and library matching. We apply the program on 27 simulation datasets as well as 2 real contrived datasets. All results show superior strength of our model over existing software.