Accurate phylogenetic classification of DNA fragments based on sequence composition
Metagenome studies have retrieved vast amounts of sequence out of a variety of environments, leading to novel discoveries and great insights into the uncultured microbial world. Except for very simple communities, diversity makes sequence assembly and analysis a very challenging problem. To understand the structure a 5 nd function of microbial communities, a taxonomic characterization of the obtained sequence fragments is highly desirable, yet currently limited mostly to those sequences that contain phylogenetic marker genes. We show that for clades at the rank of domain down to genus, sequence composition allows the very accurate phylogenetic 10 characterization of genomic sequence. We developed a composition-based classifier, PhyloPythia, for de novo phylogenetic sequence characterization and have trained it on adata set of 340 genomes. By extensive evaluation experiments we show that the methodis accurate across all taxonomic ranks considered, even for sequences that originate fromnovel organisms and are as short as 1kb. Application to two metagenome datasets 15 obtained from samples of phosphorus-removing sludge showed that the method allows the accurate classification at genus level of most sequence fragments from the dominant populations, while at the same time correctly characterizing even larger parts of the samples at higher taxonomic levels.