Skip to main content
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Rapid Classification of NifH Protein Sequences using Classification and Regression Trees


Grouping and classifying nifH gene sequences, molecular proxies for studying nitrogen fixation, are essential steps in diazotroph community analysis, and the increasing size of environmental sequence libraries necessitates a fast and automated solution. We present a novel approach to classify NifH protein sequences into well-defined phylogenetic clusters that provide a common platform for cross-ecosystem comparative analysis. Cluster membership can be accurately predicted with Classification and Regression Trees (CART) statistical models that identify and utilize signature residues in the protein sequences. The decision tree-based classification models were trained and evaluated with the publicly available cluster-annotated nifH gene database and further assessed with model-independent sequence sets from diverse ecosystems. Network graph-based exploration of cluster structures led to models for sequence classification even at finer taxonomic levels. We demonstrate the utility of this novel sequence binning approach in a comparative study where joint treatment of diazotroph assemblages from a wide range of habitats identified specialists and generalists and revealed a marine - terrestrial distinction in the community composition. Our rapid and automated cluster assignment circumvents extensive analysis of the nifH database and calculating phylogenies; hence, saves time and resources in studying nitrogen fixation.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View