Towards integrated genomics data analyses to facilitate identification of diagnostic biomarkers
- Listopad, Stanislav
- Advisor(s): Norden-Krichmar, Trina M
Abstract
While the total amount of genomic data has rapidly increased over the past decade, most individual biomedical research studies are still limited to small numbers of participant samples due to the high costs of recruitment, sequencing, data storage, and data analysis. This results in many data sets with a low number of samples, but a very large number of features across multiple genomic data types. Appropriately handling the small sample size data sets and integrating multiple genomic data types is essential for identifying actionable diagnostic biomarkers. The overarching goal of my dissertation is to address some of these challenges using software engineering, bioinformatics, and machine learning methods. In this document, I will cover the three major projects of my dissertation. First, I will describe A-Lister, a software tool that I developed to filter, compare, and combine items across multiple differential expression files, to facilitate data integration and feature selection. Second, I implemented a multiclass machine learning approach to classify liver disease and identify gene expression biomarkers using a transcriptomics liver disease dataset. As part of this analysis, I have implemented a variety of bioinformatic pipelines, feature selection techniques, and machine learning classifiers to classify small sample size RNAseq data. Third, I created an integrated model using both transcriptomics and proteomics data to identify a combined gene and protein biomarker panel to classify liver disease. The tools and methods developed in my dissertation are not specific to liver disease, but are intended for use with any small sample size genomics datasets to aid in biomarker discovery.