The past two decades have witnessed tremendous advancement in medical imaging techniques. The explosive growth of high-dimensional imaging data brings new challenges to statisticians. Machine learning has opened new horizons in a variety of tasks including image recognition and restoration, personalized medicine, medical image analysis and many others. However, machine learning systems remain mostly black boxes despite widespread adoption. Understanding the statistical properties and the predictions behind black-box models is crucial as it can help to interpret the analysis results. This dissertation dedicates to the development of new statistical learning methods for image data analysis and new insights in understanding block box predictive model behavior.
We start by proposing a novel linear discriminant analysis approach for the classification of high-dimensional matrix-valued data that commonly arises from imaging studies. Motivated by the equivalence of the conventional linear discriminant analysis and the ordinary least squares, we consider an efficient nuclear norm penalized regression that encourages a low-rank structure. Theoretical properties including a non-asymptotic risk bound and a rank consistency result are established. Simulation studies and an application to electroencephalography data show the superior performance of the proposed method over the existing approaches.
Next, we propose a novel nonparametric matrix response regression model to characterize the association between 2D image outcomes and predictors such as time and patient information. Our estimation procedure can be formulated as a nuclear norm regularization problem, which can capture the underlying low-rank structures of the dynamic 2D images. We develop an efficient algorithm to solve the optimization problem and introduce a Bayesian information criterion for our model to select the tuning parameters. Asymptotic theories including the risk bound and rank consistency are derived. We finally evaluate the empirical performance of our method using numerical simulations and real data applications from a calcium imaging study and an electroencephalography study.
Finally we propose to trace the predictions of a black-box model back to the training data through a representation theorem calibrated on a continuous, low-dimensional latent space, making the model more transparent. We show that for a given test point and a certain class, the pre-activation prediction value can be decomposed into a sum of representer values, where each representer value corresponds to the importance of the training point on the model prediction. These representer values provide users a deeper understanding of how training points lead the machine learning system to the prediction. We further elaborate our method through theoretical studies, numerical experiments and applications such as debugging models.