This thesis documents three different contributions in statistical learning theory. They were developed with careful emphasis on addressing the demands of modern statistical analysis upon large-scale modern datasets. The contributions concern themselves with advancements in information theory, dimension reduction and density estimation - three foundational topics in statistical theory with a plethora of applications in both practical problems and development of other aspects of statistical methodology.
In Chapter \ref{chapter:fdiv}, I describe the development of an unifying treatment of the study of inequalities between $f$-divergences, which are a general class of divergences between probability measures which include as special cases many commonly used divergences
in probability, mathematical statistics and information theory such as Kullback-Leibler divergence, chi-squared divergence, squared Hellinger distance, total variation distance etc. In contrast with previous research in this area, we study the problem of obtaining sharp inequalities between $f$-divergences in full generality. In particular, our main results allow $m$ to be an arbitrary positive integer and all the divergences $D_f$ and $D_{f_1}, \dots, D_{f_m}$ to be arbitrary $f$-divergences. We show that the underlying optimization problems can
be reduced to low-dimensional optimization problems and we outline methods for solving them. We also show that many of the existing
results on inequalities between $f$-divergences can be obtained as special cases of our results and we also improve on some existing
non-sharp inequalities.
In Chapter \ref{chapter:srp}, I describe the development of a new dimension reduction technique specially suited for interpretable inference in supervised learning problems involving large-dimensional data. This new technique, Supervised Random Projections (SRP), is introduced with the goal of ensuring that in comparison to ordinary dimension reduction, the compressed data is more relevant to the response variable at hand in a supervised learning problem. By incorporating variable importances, we explicate that the compressed data should still accurately explain the response variable; thus lending more interpretability to the dimension reduction step. Further, variable importances ensure that even in the presence of numerous nuisance parameters, the projected data retains at least a moderate amount of information from the important variables, thus allowing said important variables a fair chance at being selected by downstream formal tests of hypotheses.
In Chapter \ref{chapter:npmle}, I describe the development of several adaptivity properties of the Non-Parametric Maximum Likelihood Estimator (NPMLE) in the problem of estimating an unknown gaussian location mixture density based on independent identically distributed observations. Further, I explore the role of the NPMLE in the problem of denoising normal means, i.e. the problem of estimating unknown means based on observations. This problem has been studied widely. In this problem, I prove that the Generalized Maximum Likelihood Empirical Bayes estimator (GMLEB) approximates the Oracle Bayes estimator at adaptive parametric rates up to additional logarithmic factors in expected squared $\ell_2$ norm.