This dissertation discusses how predictive models are being used for scientific inquiry. Statistical and computational advances have given rise to high-dimensional models that can be fit on relatively small samples but still predict well the behavior of complex systems. Scientists try to use such models to learn about complex biological systems; but it is not always clear how prediction accuracy translates to understanding the underlying system. In the chapters below, I present different approaches to learn from predictive models in bioinformatics and neuroscience. In each of these collaborative works, we tailor models that would both fit well and be interpretable in the context of the scientific questions.
In the first chapter, we fit and compare predictive models for the GC-content bias, an important confounder in DNA-sequencing. We develop a high-resolution model that treats each base-pair in the genome as a separate example; this allows us to compare many representations of GC-content, identifying which representation best predicts the variation in the coverage. To deal with the huge volumes of data, we develop a new conditional dependence measure that efficiently compares different models. Selection of the model that maximizes this dependence reveals a recurring association with an experimental parameter: the selected model in each sample corresponds to a window size almost identical to the average size of DNA fragments in the sample. This recurring result can be used both for correcting the bias and for learning about the causes for the bias.
In the next chapter, we propose a new estimator for interpreting prediction-accuracy results of models for neural activity in the visual cortex. Our shuffle estimator targets the explainable variance - the proportion of signal in the measured response - while accounting for auto-correlation in the noise. Re-analyzing models of functional MRI voxels within visual area V1, we observe a strong linear correlation between the signal-to-noise and prediction accuracy.
In the final chapter we analyze neurophysiology data recorded from visual area V4, and present a full cycle of scientific investigation using prediction models in neuroscience. Whereas the previous chapters developed metrics for evaluating feature sets and prediction models, this chapter takes an extra leap: we use optimization algorithms together with prior scientific knowledge to propose a new feature-set. We then fit regularized linear models based on this representation that generalize well to a validation data set. Finally, novel visualization and model-summary techniques help interpret the resulting prediction models, revealing rich patterns of activity in the different neurons and unexpected categories of neurons.