Search

Scholarly Works (46 results)

Sort By:

Show:

Article
Peer Reviewed

StochHMM: a flexible hidden Markov model tool and C++ library

UC Davis Previously Published Works (2014)

Unlabelled

Hidden Markov models (HMMs) are probabilistic models that are well-suited to solve many different classification problems in computation biology. StochHMM provides a command-line program and C++ library that can implement a traditional HMM from a simple text file. StochHMM provides researchers the flexibility to create higher-order emissions, integrate additional data sources and/or user-defined functions into multiple points within the HMM framework. Additional features include user-defined alphabets, ability to handle ambiguous characters in an emission-dependent manner, user-defined weighting of state paths and ability to tie transition probabilities to sequence.

Availability and implementation

StochHMM is implemented in C++ and is available under the MIT License. Software, source code, documentation and examples can be found at http://github.com/KorfLab/StochHMM.

Cover page: StochHMM: a flexible hidden Markov model tool and C++ library

Article
Peer Reviewed

Longer First Introns Are a General Property of Eukaryotic Gene Structure

UC Davis Previously Published Works (2008)

While many properties of eukaryotic gene structure are well characterized, differences in the form and function of introns that occur at different positions within a transcript are less well understood. In particular, the dynamics of intron length variation with respect to intron position has received relatively little attention. This study analyzes all available data on intron lengths in GenBank and finds a significant trend of increased length in first introns throughout a wide range of species. This trend was found to be even stronger when using high-confidence gene annotation data for three model organisms (Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster) which show that the first intron in the 5' UTR is--on average--significantly longer than all downstream introns within a gene. A partial explanation for increased first intron length in A. thaliana is suggested by the increased frequency of certain motifs that are present in first introns. The phenomenon of longer first introns can potentially be used to improve gene prediction software and also to detect errors in existing gene annotations.

Cover page: Longer First Introns Are a General Property of Eukaryotic Gene Structure

Article
Peer Reviewed

Bind-n-Seq: high-throughput analysis of in vitro protein–DNA interactions using massively parallel sequencing

UC Davis Previously Published Works (2009)

Transcription factor-DNA interactions are some of the most important processes in biology because they directly control hereditary information. The targets of most transcription factor are unknown. In this report, we introduce Bind-n-Seq, a new high-throughput method for analyzing protein-DNA interactions in vitro, with several advantages over current methods. The procedure has three steps (i) binding proteins to randomized oligonucleotide DNA targets, (ii) sequencing the bound oligonucleotide with massively parallel technology and (iii) finding motifs among the sequences. De novo binding motifs determined by this method for the DNA-binding domains of two well-characterized zinc-finger proteins were similar to those described previously. Furthermore, calculations of the relative affinity of the proteins for specific DNA sequences correlated significantly with previous studies (R(2 )= 0.9). These results present Bind-n-Seq as a highly rapid and parallel method for determining in vitro binding sites and relative affinities.

Cover page: Bind-n-Seq: high-throughput analysis of in vitro protein–DNA interactions using massively parallel sequencing

Thesis
Peer Reviewed

From proteins, to machines, to protons, to genes, and back again

Fraga, Keith Jeffrey
Advisor(s): Korf, Ian F

UC Davis Electronic Theses and Dissertations (2022)

The success of data standards and public databases in biology is the foundation for the current and continued success of machine learning in biology and medicine. This dissertation explores the interactions between biology, computers, and people in order to develop novel machine learning methods to model complex biological problems. Data is one of the main resources to do machine learning, and Chapters 1, 2, 3 are explicitly about data organization and quality assurance in the protein Nuclear Magnetic Resonance (NMR) spectroscopy discipline. Chapters 4 and 5 present new machine learning architectures to address learning tasks in genomic site recognition and NMR chemical shift prediction. Chapter 1 investigates the manner protein NMR chemical shift data is deposited at the Biological Magnetic Resonance Bank (BMRB) in order to build simple table look-up models to estimate protein chemical shifts. In Chapter 1, we find there is low sequence diversity and data redundancy in the BMRB that was a challenge to locate and filter out. Without filtering out BMRB entries with the same sequence, and possibly the same chemical shifts, look-up models will be more accurate due to data contamination in training and testing sets. Chapter 2 examines approaches to curate a large protein sample production and NMR database to create an NMR time-domain dataset. Quality assurance tests in this NMR sample/FID database uncovered data collisions and redundancies among the database records, which motivated the development of new NMR database management tools. Chapter 3 presents a relational database schema to archive protein NMR samples and associated time-domain data called SpecDB. SpecDB is open source and available at https://github.rpi.edu/RPIBioinformatics/SpecDB.git. Chapter 4 explores how deep neural networks can recognize genomic splice acceptor and donor sites from sequence alone, achieving 97% accuracy for highly used splice donor sites. Chapter 4 also investigates neural networks for intron/exon sequence classification, maximally reaching 77% accuracy. Chapter 5 presents the application of marginalized graph kernels to prediction of NMR chemical shifts for small organic molecules. Incorporating chemical descriptors to graph kernels reaches a 3.501 ppm mean absolute error for Carbon chemical shifts. In total, the following five dissertation chapters explore work in data integrity, organization, and learning techniques from data for applications to structural biology problems.

Cover page: From proteins, to machines, to protons, to genes, and back again

Article
Peer Reviewed

SAMSA: a comprehensive metatranscriptome analysis pipeline

UC Davis Previously Published Works (2016)

Background

Although metatranscriptomics-the study of diverse microbial population activity based on RNA-seq data-is rapidly growing in popularity, there are limited options for biologists to analyze this type of data. Current approaches for processing metatranscriptomes rely on restricted databases and a dedicated computing cluster, or metagenome-based approaches that have not been fully evaluated for processing metatranscriptomic datasets. We created a new bioinformatics pipeline, designed specifically for metatranscriptome dataset analysis, which runs in conjunction with Metagenome-RAST (MG-RAST) servers. Designed for use by researchers with relatively little bioinformatics experience, SAMSA offers a breakdown of metatranscriptome transcription activity levels by organism or transcript function, and is fully open source. We used this new tool to evaluate best practices for sequencing stool metatranscriptomes.

Results

Working with the MG-RAST annotation server, we constructed the Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) software package, a complete pipeline for the analysis of gut microbiome data. SAMSA can summarize and evaluate raw annotation results, identifying abundant species and significant functional differences between metatranscriptomes. Using pilot data and simulated subsets, we determined experimental requirements for fecal gut metatranscriptomes. Sequences need to be either long reads (longer than 100 bp) or joined paired-end reads. Each sample needs 40-50 million raw sequences, which can be expected to yield the 5-10 million annotated reads necessary for accurate abundance measures. We also demonstrated that ribosomal RNA depletion does not equally deplete ribosomes from all species within a sample, and remaining rRNA sequences should be discarded. Using publicly available metatranscriptome data in which rRNA was not depleted, we were able to demonstrate that overall organism transcriptional activity can be measured using mRNA counts. We were also able to detect significant differences between control and experimental groups in both organism transcriptional activity and specific cellular functions.

Conclusions

By making this new pipeline publicly available, we have created a powerful new tool for metatranscriptomics research, offering a new method for greater insight into the activity of diverse microbial communities. We further recommend that stool metatranscriptomes be ribodepleted and sequenced in a 100 bp paired end format with a minimum of 40 million reads per sample.

Cover page: SAMSA: a comprehensive metatranscriptome analysis pipeline

Article
Peer Reviewed

Assessing the gene space in draft genomes

UC Davis Previously Published Works (2009)

Genome sequencing projects have been initiated for a wide range of eukaryotes. A few projects have reached completion, but most exist as draft assemblies. As one of the main reasons to sequence a genome is to obtain its catalog of genes, an important question is how complete or completable the catalog is in unfinished genomes. To answer this question, we have identified a set of core eukaryotic genes (CEGs), that are extremely highly conserved and which we believe are present in low copy numbers in higher eukaryotes. From an analysis of a phylogenetically diverse set of eukaryotic genome assemblies, we found that the proportion of CEGs mapped in draft genomes provides a useful metric for describing the gene space, and complements the commonly used N50 length and x-fold coverage values.

Cover page: Assessing the gene space in draft genomes

Article
Peer Reviewed

GC skew is a conserved property of unmethylated CpG island promoters across vertebrates

UC Davis Previously Published Works (2015)

GC skew is a measure of the strand asymmetry in the distribution of guanines and cytosines. GC skew favors R-loops, a type of three stranded nucleic acid structures that form upon annealing of an RNA strand to one strand of DNA, creating a persistent RNA:DNA hybrid. Previous studies show that GC skew is prevalent at thousands of human CpG island (CGI) promoters and transcription termination regions, which correspond to hotspots of R-loop formation. Here, we investigated the conservation of GC skew patterns in 60 sequenced chordates genomes. We report that GC skew is a conserved sequence characteristic of the CGI promoter class in vertebrates. Furthermore, we reveal that promoter GC skew peaks at the exon 1/ intron1 junction and that it is highly correlated with gene age and CGI promoter strength. Our data also show that GC skew is predictive of unmethylated CGI promoters in a range of vertebrate species and that it imparts significant DNA hypomethylation for promoters with intermediate CpG densities. Finally, we observed that terminal GC skew is conserved for a subset of vertebrate genes that tend to be located significantly closer to their downstream neighbors, consistent with a role for R-loop formation in transcription termination.

Cover page: GC skew is a conserved property of unmethylated CpG island promoters across vertebrates

Article
Peer Reviewed

Comparative and functional analysis of intron-mediated enhancement signals reveals conserved features among plants

UC Davis Previously Published Works (2011)

Introns in a wide range of organisms including plants, animals and fungi are able to increase the expression of the gene that they are contained in. This process of intron-mediated enhancement (IME) is most thoroughly studied in Arabidopsis thaliana, where it has been shown that enhancing introns are typically located near the promoter and are compositionally distinct from downstream introns. In this study, we perform a comprehensive comparative analysis of several sequenced plant genomes. We find that enhancing sequences are conserved in the multi-cellular plants but are either absent or unrecognizable in algae. IME signals are preferentially located towards the 5'-end of first introns but also appear to be enriched in 5'-UTRs and coding regions near the transcription start site. Enhancing introns are found most prominently in genes that are highly expressed in a wide range of tissues. Through site-directed mutagenesis in A. thaliana, we show that IME signals can be inserted or removed from introns to increase or decrease gene expression. Although we do not yet know the specific mechanism of IME, the predicted signals appear to be both functional and highly conserved.

Cover page: Comparative and functional analysis of intron-mediated enhancement signals reveals conserved features among plants

Article
Peer Reviewed

Evidence for a DNA-Based Mechanism of Intron-Mediated Enhancement

UC Davis Previously Published Works (2011)

Many introns significantly increase gene expression through a process termed intron-mediated enhancement (IME). Introns exist in the transcribed DNA and the nascent RNA, and could affect expression from either location. To determine which is more relevant to IME, hybrid introns were constructed that contain sequences from stimulating Arabidopsis thaliana introns either in their normal orientation or as the reverse complement. Both ends of each intron are from the non-stimulatory COR15a intron in their normal orientation to allow splicing. The inversions create major alterations to the sequence of the transcribed RNA with relatively minor changes to the DNA structure. Introns containing portions of either the UBQ10 or ATPK1 intron increased expression to a similar degree regardless of orientation. Also, computational predictions of IME improve when both intron strands are considered. These findings are more consistent with models of IME that act at the level of DNA rather than RNA.

Cover page: Evidence for a DNA-Based Mechanism of Intron-Mediated Enhancement

Article
Peer Reviewed

The collaborative effect of scientific meetings: A study of the International Milk Genomics Consortium.

UC Davis Previously Published Works (2018)

Collaboration among scientists has a major influence on scientific progress. Such collaboration often results from scientific meetings, where scientists gather to present and discuss their research and to meet potential collaborators. However, most scientific meetings have inherent biases, such as the availability of research funding or the selection bias of professional societies that make it difficult to study the effect of the meeting per se on scientific productivity. To evaluate the effects of scientific meetings on collaboration and progress independent of these biases, we conducted a study of the annual symposia held by the International Milk Genomics Consortium (IMGC) over a 12-year period. In our study, we conducted permutation testing to analyze the effectiveness of the IMGC in facilitating collaboration and productivity in a community of milk scientists who were meeting attendees relative to non-attendees. Using the number of co-authorships on published papers as a measure of collaboration, our analysis revealed that scientists who attended the symposium were associated with more collaboration than were scientists who did not attend. Furthermore, we evaluated the scientific progress of consortium attendees by analyzing publication rate and article impact. We found that IMGC attendees, in addition to being more collaborative, were also more productive and influential than were non-attendees who published in the same field. The results of our study suggest that the annual symposium encouraged interactions among disparate scientists and increased research productivity, exemplifying the positive effect of scientific meetings on both collaboration and progress.

Cover page: The collaborative effect of scientific meetings: A study of the International Milk Genomics Consortium.