UC Santa Cruz
Learning Structured and Causal Probabilistic Models for Computational Science
- Author(s): Sridhar, Dhanya
- Advisor(s): Getoor, Lise
- et al.
The drive to understand human phenomena such as our behavior and biology guides scientific discovery in the social and biological sciences. Today's wealth of observational and experimental data presents both opportunities and challenges for machine learning methods to facilitate these discoveries around human behavior and biology. Social media sites provide observational data, capturing snapshots of how users feel towards current events, engage in discourse with one another, and reflect on behavioral factors that affect their mood. These rich textual data support socio-behavioral modeling and understanding. In biology, large-scale experimental datasets are available, coupled with extensive efforts to extract and curate scientific ontologies and knowledge bases. Such empirical data enables inferences in pharmaceutical sciences and genetics. While standard machine learning methods build probabilistic models using social media posts or gene expression levels, they fall short on handling three important challenges in these problems. First, in socio-behavioral and biological domains, inferences are interrelated and require collective reasoning. Second, prior knowledge from multiple sources such as textual or experiment evidence are abundant and probabilistic methods must fuse these signals of varying fidelity. Third, to advance discoveries in social and biological sciences, computational methods must go beyond predictive performance. In both domains, experts seek new insights and knowledge, requiring techniques to discover patterns and causal relationships directly from data.
My dissertation addresses the challenges of computational science domains by developing a unified probabilistic framework that: 1) exploits useful structure in the domain to make collective inferences; 2) fuses several sources of signals; 2) discovers causal structure; 4) enables learning of complex, structured models directly from data. I validate this framework on important scientific modeling problems such as online debate and dialogue, mood and behavioral choices, interactions between drug treatments, and gene regulation. In this thesis, I first develop structural patterns for collective inference by evaluating several modeling choices for online debates. My findings illustrate the harms of naive collective reasoning while showing the benefits of jointly modeling debate interactions and users' stances. I extend these collective patterns to fuse several sources of biological information which lead to state-of-the-art performance in drug-drug interaction prediction. To go beyond predictive performance, I combine multiple statistical signals to infer causal networks of gene regulation from measurements of gene expression and estimate causal effects in dialogue. Finally, I develop algorithms that learn these modeling patterns directly from data, showing the benefits of discovering complex dependencies in the drug-drug interaction prediction domain. The technical contributions highlighted in my thesis lay the foundation for applying structured and causal models to computational science. I conclude by outlining promising areas of future research that stem from my work and further bolster probabilistic methods for scientific domains.