Computational tools for the analysis of high-throughput genome-scale sequence data
As high-throughput sequence data becomes increasingly used in a variety of fields, there is a growing need for computational tools that facilitate analyzing and interpreting the sequence data to extract biological meaning. To date, several computational tools have been developed to analyze raw and processed sequence data in a number of contexts. However, many of these tools primarily focus on well-studied, reference organisms, and in some cases, such as the visualization of molecular signatures in expression data, there is a scarcity or complete absence of tools. Furthermore, the compendium of genome-scale data in publicly accessible databases can be leveraged to inform new studies. The focus of this dissertation is the development of computational tools and methods to analyze high-throughput genome-scale sequence data, as well as applications in mammalian, algal, and bacterial systems. Chapter 1 introduces the challenges of analyzing high-throughput sequence data. Chapter 2 presents the Signature Visualization Tool (SaVanT), a framework to visualize molecular signatures in user-generated expression data on a sample-by-sample basis. This chapter demonstrates that SaVanT can use immune activation signatures to distinguish patients with different types of acute infections (influenza A and bacterial pneumonia), and determine the primary cell types underlying different leukemias (acute myeloid and acute lymphoblastic) and skin disorders. Chapter 3 describes the Algal Functional Annotation Tool, which biologically interprets large gene lists, such as those derived from differential expression experiments. This tool integrates data from several pathway, ontology, and protein domain databases and performs enrichment testing on gene lists for several algal genomes. Chapter 4 describes a survey of the Chlamydomonas reinhardtii transcriptome and methylome across various stages of its sexual life cycle. This chapter discusses the identification and function of 361 gamete-specific and 627 zygote-specific genes, the first base-resolution methylation map of C. reinhardtii, and the changes in chloroplast methylation throughout key stages of its life cycle. Chapter 5 presents a comparative genomics approach to identifying previously uncharacterized bacterial microcompartment (BMC) proteins. Based on genomic proximity of genes in 131 fully-sequenced bacterial genomes, this chapter describes new putative microcompartments and their function.