This study presents three projects that build on machine learning techniques to propose new tools for scientific discovery: 1) constructing and measuring latent variables at scale by combining faceted Rasch measurement with supervised deep learning, 2) translating the problem of exposure mixtures (i.e. projecting multiple treatment variables onto a continuous summary measure) into a data-adaptive parameter within the targeted learning causal inference framework, and proposing an estimation algorithm, and 3) combining multiple techniques to improve the predictive accuracy and interpretability of clinical prediction models developed within the context of electronic health records.
The first project (Chapter 2) develops a general methodology to combine faceted Rasch measurement, a theoretically optimal form of item response theory, with supervised deep learning to construct and measure arbitrary interval variables. Rasch measurement theory is a method to create interval latent variables that are not directly observed, but can be approximated by collecting data on a set of components that are believed to contain information that indicates where an observation falls on that latent spectrum. The faceted version of Rasch modeling extends the method to rater-mediated assessments, which are essentially what most deep learning projects entail when they rely on human labelers to generate a training dataset. Bringing the formal tools of item response theory to machine learning offers a host of benefits, including reducing survey interpretation bias in the labeled data and upgrading the target variable from a dichotomous or ordinal structure to a continuous, interval variable, increasing precision. Perhaps most importantly, Rasch measurement theory provides a structure for gradual iteration based on the interplay between theorization and empirical testing, without which the field of machine learning has been forced to develop ad hoc solutions.
I realized in the course of the project that item response theory lent itself to a natural integration with a deep learning-based estimator for applying the constructed variable to new data. While the deep learning model could be designed to predict the interval variable directly, as would be standard practice, an alternative architecture became visible: train the deep learning estimator to predict the individual scale components (items) from the labeling instrument, then apply the item response theory transform to those components in an offline fashion. Such an architecture leads to a new form of model explanation, because unlike standard neural models where the final dense layers are randomly initialized and generate their own internal latent variables that predict the final score, in our system those final latent variables were directly proposed via theorization and we had labeled data for which the neural optimization could provide supervised feedback. Predicting each item as separate outcomes in a single neural network entails a "multitask" architecture: the system simultaneously optimizes its predictive accuracy for each task, i.e. the predicted rating on each item. Multitask architectures are believed to offer efficiency gains because correlation between tasks allows information about one task (item rating) to also inform the model's prediction on other related tasks. And in an item response theory model the items would generally be highly correlated due to their coordinated measurement of an underlying latent variable. Predicting the rating on each item offers another twist: those item ratings are ordinal variables, so if we incorporate that scientific knowledge in our estimator we have the potential to gain efficiency and generate predictions that are consistent with that ordinal structure (i.e. predicted probabilities are unimodal across the possible ratings).
The second project (Chapter 3) examines the problem of estimating exposure mixtures, which are collections of treatment variables for which we seek to examine joint effects on a given outcome variable (e.g. disease state). Taking inspiration from the parametric method of weighted quantile sum regression, we recast the problem of exposure mixture estimation as a data-adaptive statistical parameter within the targeted learning causal inference framework. That allowed the establishment of an estimation procedure to nonparametrically project the vector of treatment variables onto a continuous latent variable that maximizes the joint relationship with the outcome, One could then evaluate causal parameters, such as treatment-specific means, on a held-out validation set using the cross-validated targeted maximum likelihood estimation procedure (CV-TMLE). Our method builds on earlier work with Alan Hubbard and Mark van der Laan as part of the varimpact algorithm, which is a data-adaptive method for causal variable importance that examines a single variable at a time. Through our mixture work we realized that an exposure mixture is a form of variable set importance, allowing subgroups of treatment variables to be ranked based on their combined mixture's estimated impact on the outcome variable.
The final project (Chapter 4), seeks to provide a guide to the development of high-quality clinical prediction models based on electronic health record data. Through the process of creating a risk prediction model for future heart attacks, we propose improved methods for many steps, including generalized low-rank models for missing data imputation, penalized histogramming to manage the cardinality of imputed covariates, nested SuperLearner ensembling for interpretable hyperparameter optimization, accumulated local effect plots for model explanation, and the index of prediction accuracy as a general performance metric combining discrimination and calibration.