Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Mining Biological Data with Machine Learning to Obtain New Insights

No data is associated with this publication.
Abstract

Molecular biology is a highly complicated subject due to the complexity of cells and organisms as systems. To decode them, recent efforts are dedicated to generating data at an increasingly large scale. Luckily, the emergence of two major methods in bioinformatics, network analysis and machine learning, have provided us with tools to digest these data into

biological insights. During my PhD, I have improved and developed methods in network analysis and machine learning to make them more efficient and generate human understandable insights. In chapter I, colleague and I developed a new version of information flow analysis algorithm with time complexity of O(n2m) instead of O(n4). We also developed a GPU version of this algorithm that can achieve a 17.4x speed up over the original implementation on a Hi-C network of a relatively small chromosome. This would allow us to apply information flow analysis on much larger biological networks. In chapter II, I developed contextual regression, a deep neural network framework that can achieve state of the art prediction accuracy while generating human interpretable outputs. We tested this method on a simulated biological signal dataset and found our method not only has good prediction performance, but also is able to uncover the ground truth model even under a noise level of 80%. To test this method on real world data, we also applied this method to analyze the effect of histone marks on open chromatin. The model not only outperforms previous models in terms of prediction, but also uncover histone mark patterns that are associated with open chromatin formation. In chapter III, colleague and I applied the contextual regression framework on circular RNA data. Combining this new method and the circNet database, we uncovered 7 types of circular RNA genesis mechanisms. This discovery supports the hypothesis that multiple biogenesis mechanisms co-exist for different subsets of human circRNAs. In chapter IV, I developed a new contextual regression architecture and applied it to the task of predicting RNA expression level from gene sequences. The new architecture not only achieved state of the art accuracy, but also learned important motifs that can affect gene expression. I then developed a selection process and generated a linear formula of 300 sequence motifs that can predict expression level as accurately

as deep neural network models. The most influential motifs among these motifs are deeply involved in development, regulation and stimulus response. When we applied this linear formula to different cell lines, we found the parameters of this formula are associated with biological properties such as epigenetic influence through development, cell lineages and difference of expression patterns among cell lines. We anticipate that these studies have created tools towards the challenge of mining biological insights from increasing amounts of biological data.

Main Content

This item is under embargo until January 5, 2025.