Rapid Classification of NifH Protein Sequences using Classification and Regression Trees
- Author(s): Frank, Ildiko E.;
- Advisor(s): Zehr, Jonathan;
- et al.
Grouping and classifying nifH gene sequences, molecular proxies for studying nitrogen fixation, are essential steps in diazotroph community analysis, and the increasing size of environmental sequence libraries necessitates a fast and automated solution. We present a novel approach to classify NifH protein sequences into well-defined phylogenetic clusters that provide a common platform for cross-ecosystem comparative analysis. Cluster membership can be accurately predicted with Classification and Regression Trees (CART) statistical models that identify and utilize signature residues in the protein sequences. The decision tree-based classification models were trained and evaluated with the publicly available cluster-annotated nifH gene database and further assessed with model-independent sequence sets from diverse ecosystems. Network graph-based exploration of cluster structures led to models for sequence classification even at finer taxonomic levels. We demonstrate the utility of this novel sequence binning approach in a comparative study where joint treatment of diazotroph assemblages from a wide range of habitats identified specialists and generalists and revealed a marine - terrestrial distinction in the community composition. Our rapid and automated cluster assignment circumvents extensive analysis of the nifH database and calculating phylogenies; hence, saves time and resources in studying nitrogen fixation.