Metagenomic Binning Algorithms
- Author(s): Gao, Chen
- Advisor(s): Cui, Xinping
- et al.
Metagenomics is the study of DNAs of microorganisms that are taken directly from environmental samples without cultivation and isolation. Recently, the emerging field of metagenome sequencing, facilitated by the high-throughput capability of NGS technology, allows the simultaneous sequencing all genomes in an environmental sample while also results in high complexity datasets. Although the NGS technology significantly improve the sequencing efficiency and cost, assembly of metagenomic sequences into genomes is extremely difficult since the reads are very short and sampled are from multiple genomes. Several computational methods have been developed to group metagenomic sequence reads into different bins, which can be categorized into two classes: supervised methods and unsupervised methods. Supervised methods may leave a large fraction of reads unclassified due to low rate of known reference genome in the database, while the unsupervised methods are still undergoing active development. The performance of existing unsupervised methods rely heavily on the length of reads, the number of species in the sample and the evenness of species abundance. It is also challenging for some algorithms to operate without a pre-specified number of species, which is not a trivial assumption to make.
In this work, we present a novel algorithm, the DirichletCluster, based on Markovian assumption and sequential Monte Carlo (SMC) technique that has shown high binning accuracy under various scenarios with data-driven approach to estimate the number of species systematically. Specifically, we looked at the Markovian structure of the nucleotide reads, and implemented a mixture Dirichlet process model with the Markov chain structure. The Dirichlet process is a stochastic processs describing distribution over probability measures, which indicates draws from this process can be interpreted as random distributions. By using the mixture Dirichlet process model, we are able to characterize the individual genome sequence, as well as the clusters of sequences. Sequential Monte Carlo, together with GC content ordering, is implemented to cluster reads into species using a simulation based approach. We show through some simulation studies and a real data application that the proposed DirichletCluster binning algorithm to be robust to the evenness of abundance ratio and to be able to correctly identify the most number of species from the metagenomic data among alternatives. Moreover, it uses a complete data-driven approach to estimate the total number of species in the metagenomic sample. Therefore, we believe that DirichletCluster is a performant binning algorithm that is beneficial to the advancement of Metagenomics research.