Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Proteogenomics : applications of mass spectrometry at the interface of genomics and proteomics

Abstract

Proteins are understood to be the main workhorses in the cell, participating in a wide variety of activities from cell structure to inter- and intra-cellular transport. Through improvements in sample preparation and instrumentation, mass spectrometry has become a popular, efficient, high throughput technology for studying protein expression. The standard protocol for a mass spectrometry experiment includes digestion of the sample proteins into peptides that are subsequently analyzed by the mass spectrometer to produce tandem mass spectra. An important initial step in characterization of the sample is the identification of the peptide precursor of each spectrum. This routinely involves the comparison of the experimental spectrum to the theoretical spectrum associated with a peptide sequence contained in a protein sequence database. Publicly available protein sequence databases are believed to be complete for well understood, model organisms. However, in this thesis we demonstrate that even for organisms that receive extensive attention, the databases are missing a significant fraction of expressed proteins. We describe two situations in which a comprehensive protein sequence database is not available for peptide identification and propose methods for addressing the issue. Determining the nucleotide sequence that comprises an organism's genome is only the first step to understanding the molecular basis for its phenotype. Genome annotation is required to determine the function of each nucleotide, including nucleotides that encode the blueprint for proteins. We present a semi-automated pipeline that accepts mass spectra and the sequenced genome, and addresses the dual goals of annotating the genome for protein-coding genes and identifying peptide sequences in the absence of a complete, curated protein sequence database. The pipeline mentioned above for genome annotation, assumes that the genome is immutable. Immunoglobulins, proteins involved in our adaptive immune systems, require rearrangements in the genome resulting in a different immunoglobulin gene sequence in nearly every B -cell in the body. We build on the ideas of genome annotation to construct a database to represent the complement of possible immunoglobulin gene sequences in an organism. In addition, we move beyond the goal of peptide identification to sequence entire proteins

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View