Towards biomedical interpretability: methods for knowledge graph machine learning research in medicine
- Author(s): Nelson, Charlotte
- Advisor(s): Bandyopadhyay, Sourav
- et al.
If you put your clinical information into an algorithm and it told you exactly when you will get sick and what disease you will have, what would be your first question? Mine would be: why? This question is the motivation for my dissertation because the only way to prevent an outcome from happening is if we know on a biological level why an outcome will happen. There are substantial barriers in answering this question including data silos and limitations in the explainability of current machine learning algorithms. In an attempt to bypass these barriers, we have expanded a knowledge graph – SPOKE – that embraces the natural heterogeneity and complexity of biology by connecting together over 30 biological and medical databases. We show that propagating data through SPOKE allows for the generation of human and machine readable embeddings that describe the input(s) (i.e. what’s measured) in terms of all nodes in SPOKE (Disease, Genes, Compounds, Pathways, etc.). These embeddings are called Propagated SPOKE Entry Vectors (PSEVs). Our research demonstrates that PSEVs are useful for multiple types of inputs. In two studies we show the power of this approach using Electronic Health Records (EHRs) as inputs. The first study serves as a foundation and proves that PSEVs contain known and novel relationships between the input data and nodes in SPOKE. The second study compares the use of EHRs and SPOKEsigs (signatures or PSEVs for individual patients at a specific timepoint) in predicting whether a patient will develop multiple sclerosis. Moreover, we illustrate how to retrace the outcome to the biological drivers of the classifier. Finally, in collaboration with NASA, we use mouse transcriptomic data captured during spaceflight, to infer physiological changes experienced by astronauts.