Disease progression manifests through a broad spectrum of statically and longitudinally linked clinical features and outcomes. This leads to heterogeneous progression patterns that may vary greatly across individual patients and makes the survival and quality of a patient’s life substantially different. Recently, the rapid increase of healthcare databases, such as electronic health records (EHRs) and disease registries, has opened new opportunities for "data-driven" approaches to clinical decision support systems. This dissertation addresses the question of how machine learning (ML) techniques can capitalize on these data resources and provide actionable intelligence to move away from a rules-based clinical care toward a more data-driven and personalized model of care.
To this end, we develop a set of data-driven ML frameworks that can better predict and understand disease progression under two broad clinical setups: (I) the static setup where patients’ observations are collected at a particular point of time and (II) the longitudinal setup where observations of each patient are repeatedly collected over a period of time. In these setups, we focus on building ML methods that are (i) accurate by providing better performance in predicting disease-related outcomes, (ii) automated by freeing clinicians from the concern of choosing one particular model for a given dataset at hand, and (iii) actionable in a sense that the model is capable of answering "what if" questions and discovering subgroups of patients with similar progression patterns and outcomes.
We highlight the following technical contributions. In the static setting, we present a set of novel ML algorithms for survival analysis, a framework that informs the relationships between the clinical features and the events of interest (such as death, onset of a certain disease, etc.), and predicts what type of event will occur and when it will occur. We start off by developing a deep learning (DL) method that makes no modeling assumptions about the underlying survival process and that flexibly allows for competing events. Then, we propose an automated ML for survival analysis that combines the collective intelligence of different survival models to produce a valid survival function that is both discriminative and well-calibrated. Lastly, we develop a DL model that can accurately estimate heterogeneous treatment effects in survival analysis by adjusting for covariate shifts from multiple sources which makes the problem unique and challenging. In the longitudinal setting, we first develop a DL model for dynamic survival analysis which provides personalized and event-specific survival predictions based on a patient’s heterogeneous and historical context. Then, we provide a novel temporal clustering method that can transform the raw information in the complex longitudinal observations into clinically relevant and interpretable information to recognize future outcomes as well as life-changing disease manifestations which may cause a patient to transit between clusters.
To show the utilities of the proposed models, we evaluate the performance on various real-world medical datasets on breast cancer, prostate cancer, and cystic fibrosis patient cohorts. We demonstrate that the proposed models consistently outperform clinical scores and state-of-the-art ML methods in predicting disease progression, estimating the heterogeneous treatment effects, and providing insights into underlying disease mechanisms.