Using emerging data analysis technics to improve pediatric disease diagnosis
- Bao, Bokan
- Advisor(s): Lewis, Nathan E;
- Courchesne, Eric
Abstract
Many data types are used in bioinformatics research, including genomics, transcriptomics, proteomics, pathway data, disease network, and gene ontology (GO) data, which are heavily studied in disease diagnosis or biomarker detection. The use of newer data types, such as glycomics, fMRI, and facial behavior data, is also growing and can provide unique perspectives for disease cell biology. These new data types have unique properties that require newly adapted algorithms for precise and granular characterization, which is essential before machine learning or statistical models can be confidently used to study disease mechanisms or identify biomarkers from large-scale datasets. The newly developed tools can then allow sophisticated evaluations and yield high-quality results. The first part of my thesis introduced GlyCompare, a powerful glycomics analysis pipeline. The pipeline corrects for the sparsity and non-independence in glycomics data by accounting for the shared biosynthetic network in the data. This new approach makes the downstream analyses more interpretable and better powered.Then in the second part, a generalizable machine learning platform was developed with 42,840 models composed of 3570 gene expression feature sets and 12 classification methods. A gene expression ASD diagnostic classifier built with this platform had AUC-ROC ≥ 0.8 on both Training and Test sets. Our classifier is diagnostically predictive and replicable across different toddler ages, races, and ethnicities; outperforms the risk gene mutation classifier; and has potential for clinical translation. In the last section, I developed a pipeline to evaluate facial behavior data from toddlers using state-of-the-art expression analysis software. In certain situations, emotional response is overly intense in ASD compared to other toddlers. Our action unit classifier had a sensitivity of 83.3% and a specificity of 67.5% in the test dataset (90.1% and 75% in the training dataset). We verified that our classifier was unbiased against common confounding factors (age, race, and ethnicity). By combining the action unit classifier and Geo-Pref non-social score, we achieved a specificity of 100% and sensitivity of 50% on the training and test datasets. The ensemble classifier maintained the high specificity while considerably increasing the sensitivity, which provides the potential for screening applications.