Chapter 1: The number of known sequences for the nifH gene, commonly used to assess community potential for N2 fixation, has been rapidly growing over the past few decades. Obtaining these sequences from the National Center for Biotechnology Information's GenBank database is problematic because of annotation errors, nomenclature variation, and paralogues; moreover, GenBank's tools are not conducive to searching solely by function. A software retrieval and curation pipeline called ARBitrator was developed that uses a BLAST search followed by a screening phase based on conserved domain similarity to retrieve nifH sequences from Genbank. A total of 34,420 nifH sequences were identified in GenBank. The false-positive rate is 0.033%. The pipeline can be adapted for other genes.
Chapter 2: Marine cyanobacteria capable of fixing molecular nitrogen ("diazotrophs") are key in biogeochemical cycling; the nitrogen fixed is a major external source of nitrogen to the open ocean. Candidatus Atelocyanobacterium thalassa (UCYN-A) is a diazotrophic cyanobacterium known for its widespread geographic distribution, unusually reduced genome, and symbiosis with a single-celled prymnesiophyte alga. Recently a novel strain of this organism, called UCYN-A2, was detected in coastal waters off southern California. We assembled and analyzed the metagenome of this UCYN-A2 population. UCYN-A2 and the open-ocean UCYN-A1 strain share most protein-coding genes with high synteny, yet average amino-acid sequence identity between orthologous genes is only 86%. Our results suggest that UCYN-A1 and UCYN-A2 had a common ancestor and diverged after genome reduction.
Chapter 3: Gene expression in cells fluctuates over time in response to internal and external stimuli. Cyanobacteria, whose metabolism is tightly coupled to the sunlight cycle, have evolved complex patterns of gene expression that may derive from optimizing growth by coordinating photosynthesis and protein synthesis. These patterns can provide much information on how microorganisms grow and respond to the environment. However, analyzing complex information from whole-genome expression studies is difficult, requiring computational approaches that support visual exploration and analysis of data in order to detect related expression signatures.
A Java application called Dexter was developed to facilitate analysis of single or multiple gene expression time-series data sets. The value of the program was demonstrated by using it to improve operon predictions