Building bioinformatic tools for massive repurposing of multi-omic data in the Sequence Read Archive
There are currently millions of sequencing experiments generated from various research groups. Each study often offers a pre-processed -omic matrix as a supplementary table in the manuscript or is deposited to public data repositories. The pre-processed -omic data is essential to the research community to derive -omic based statistical and machine learning predictive models which can be useful for understanding biological features. While the analysis-ready pre-processed -omic data minimizes the prerequisite of high-performance computing and data processing knowledge for researchers, the lack of consistency in these submitted preprocessed data and their associated metadata pose a challenge towards secondary data analysis.
Towards the goal of tackling this challenge, the first chapter of the thesis evaluates the possibility of reprocessing and extracting the allelic read counts from over 250,000 sequencing experiments and 10,000 public studies in the SRA, which is useful towards identifying novel variant associations and evaluating the allele-specific expression. The second chapter assesses the possibility of constructing an online platform in which the research community can smoothly to go from data querying to analysis without any programming background. The third chapter tackles the metadata aspect of the SRA by evaluating the possibility of recognizing the biomedical entities in the metadata without expert curation. Through repurposing the vast amount of submitter-based biospecimen annotations, we can train a deep-learning-based model to evaluate the relationships between various annotations. In summary, this dissertation builds the foundation towards automating the process of -omic data association with metadata.