As machine learning becomes a progressively empirical field, the need for rigorous empirical evaluation of existing methodology grows. In this dissertation I present a line of work that studies the effect of subtle distribution shifts to classifier accuracy. I then present a construction and evaluation of high performance classifiers using tools from classical kernel literature.
To study the effect of distribution shfts in machine learning, we build new test sets for the CIFAR-10 and ImageNet datasets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% – 15% on CIFAR-10 and 11% – 14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models’ inability to generalize to slightly “harder” images than those found in the original test sets.
We then perform an in-depth evaluation of human accuracy on the ImageNet dataset. First, three expert labelers re-annotated 30,000 images from the original ImageNet validation set and the ImageNetV2 replication experiment with multi-label annotations to enable a semantically coherent accuracy measurement. Then we evaluated five trained humans on both datasets. The median of the five labelers outperforms the best publicly released ImageNet model by 1.5% on the original validation set and by 6.2% on ImageNetV2. Moreover, the human labelers see a substantially smaller drop in accuracy between the two datasets compared to the best available model (less than 1% vs 5.4%). Our results put claims of superhuman performance on ImageNet in context and show that robustly classifying ImageNet at human-level performance is still an open problem.
To study the effect of another form of distribution shift, we study the robustness of image classifiers to temporal perturbations derived from videos. As part of this study, we construct two new datasets, ImageNet-Vid-Robust and YTBB-Robust, containing a total of 57,897 images grouped into 3,139 sets of perceptually similar images. Our datasets were derived from ImageNet-Vid and Youtube-BB respectively and thoroughly re-annotated
1
by human experts for image similarity. We evaluate a diverse array of classifiers pre-trained on ImageNet and show a median classification accuracy drop of 16 and 10 percent, respectively, on our two datasets. Additionally, we evaluate three detection models and show that natural perturbations induce both classification as well as localization errors, leading to a median drop in detection mAP of 14 points. Our analysis demonstrates that perturbations occurring naturally in videos pose a substantial and realistic challenge to deploying convolutional neural networks in environments that require both reliable and low-latency predictions.
Finally, we investigate the connections between neural networks and simple building blocks in kernel space. In particular, using well established feature space tools such as direct sum, averaging, and moment lifting, we present an algebra for creating “compositional" kernels from bags of features. We show that these operations correspond to many of the building blocks of “neural tangent kernels" (NTK). Experimentally, we show a correlation in test error between neural network architectures and the associated kernels. We construct a simple neural network architecture using only 3 × 3 convolutions, 2 × 2 average pooling, ReLU, and optimized with SGD and MSE loss that achieves 96% accuracy on CIFAR10, and whose corresponding compositional kernel achieves 90% accuracy. We also use our constructions to investigate the relative performance of neural networks, NTKs, and compositional kernels in the small dataset regime. In particular, we find that compositional kernels outperform NTKs and neural networks outperform both kernel methods.