Metagenomics is revolutionizing microbial ecology and has unlocked unprecedented opportunities in many domains of Life Science. For instance, metagenomics has allowed the discovery of new forms of life in unexplored habitats (e.g., in the marine environment). In medicine, metagenomics is allowing doctors to diagnose and help patients that have diseases related to imbalances in their microbial communities (e.g., gastrointestinal microbiota). In public health, metagenomics is becoming an invaluable instrument for pathogen surveillance and to monitor outbreaks in epidemic areas.
As sequencing technologies have considerably improved in speed and cost over the past decade, the number of reference sequences in public databases has grown exponentially. As a consequence, faster, accurate and efficient computational methods are needed for analyzing these large data. The research presented in this dissertation focuses on (i) how to build faster, more accurate and more efficient sequence classification methods to determine the microbial composition of metagenomic samples and (ii) how to infer and recover the microbial composition of a sample in a large network of connected samples (e.g., in the context of a city-scale biosurveillance).
Our classification system is composed of a family of tools, namely CLARK, CLARK-l and CLARK-S, which are currently used by several research teams worldwide for metagenomics and genomics analysis. While CLARK is able to perform with high accuracy sequence classification and unprecedented speed, CLARK-S achieves the same precision and a much higher accuracy than CLARK, at a cost of a slightly slower speed.