High-throughput data have become ubiquitous in the study of biological phenomena. We can now query cellular state at higher resolution, giving us better insight into complex diseases.
For example, there are currently tens of thousands of cancer patients with simultaneous copy number, mutation, methylation, mRNA, miRNA and protein level profiles.
Furthermore, cellular perturbations are increasingly characterized on the multi-omic level.
These experiments uncover important dependencies among genes, their products and environmental conditions - relationships that accumulate in a growing number of databases.
However, the integration of such prior pathway knowledge with new heterogeneous genomic measurements in an interpretable model remains a formidable challenge that is still not fully solved.
My thesis presents three different approaches which incrementally address that problem.
First, I present a feature engineering method (hVIPER) that infers kinase protein activity levels in a pathway-informed manner.
Next, I develop one of the joint winners of the DREAM9 Gene Essentiality Prediction Challenge - a Multiple Kernel Learning algorithm with multi-omic pathway-derived kernel functions (MPL).
Finally, I improve upon the DREAM9 winner by introducing empirical kernel functions computed through Random Forest tree ensembles (AKLIMATE).
AKLIMATE outperforms state-of-the-art methods in diverse phenotype learning tasks, including predicting microsatellite instability in endometrial and colorectal cancer, survival in breast cancer and shRNA knockdown response in CCLE cell lines.
In conclusion, I briefly demonstrate how AKLIMATE can be adapted to the development of multi-omic minimum-feature predictors for patient subtypes.