This dissertation considers the subject of information losses arising from finite datasets used in the training of neural classifiers. It proves a relationship between such losses and the product of the expected total variation of the estimated neural model with the information about the feature space contained in the hidden representation of that model. It then bounds this expected total variation as a function of the size of randomly sampled datasets in a fairly general setting, and without bringing in any additional dependence on model complexity. It ultimately obtains bounds on information losses that are less sensitive to input compression and much tighter than existing bounds. It then uses these bounds to explain some recent experimental findings of information compression in neural networks which cannot be explained by previous work. The dissertation goes on to provide analytical derivations for the relationship between neural architectures and the mutual information contained in their representations, which can be useful for guided architecture selection schemes. It then uses these developments to propose and illustrate a new framework for analyzing training data selection methods. The dissertation use this framework to prove that facility location methods reduce these losses, and then derive a new data dependent bound on them. This bound can be used to evaluate datasets and acts as an additional analytical tool for the study of data selection techniques. The dissertation then applies this theory to the problem of Phase Identification in power distribution systems. In particular, it focuses on improving supervised learning accuracies by exploiting some of the problem's information theoretic properties. This focus, along with the advances developed earlier in this work, helps us create two new Phase Identification techniques. The first transforms the bound on information losses into a data selection technique. This is important because phase identification data labels are difficult to obtain in practice. The second interprets the properties of distribution systems in the terms of the information losses developed earlier in the dissertation. This allows us to obtain an improvement in the representation learned by any classifier applied to the problem. Furthermore, since many problems in cyber-physical systems share similarities to the physical properties of phase identification exploited in this dissertation, the techniques can be applied to a wide range of similar problems.