Skip to main content
eScholarship
Open Access Publications from the University of California

UC Irvine

UC Irvine Electronic Theses and Dissertations bannerUC Irvine

Stuff’s Cheap, Things Are Expensive: Recognizer Disparities for Object vs. Homogeneous Texture Patches

Abstract

Decades prior to the advent of deep learning, filter-based recognition and synthesis techniques functioned competently on patches of homogeneous texture (Stuff), as opposed to object-centric patches (Things). Since this competence gap was so obvious, requiring the revolutionary invention of new models to handle Things and scenes at all, the gap has largely eluded quantification. Using a subset of images from the Caltech256 and USPTex databases, this dissertation develops methodology to quantify a probable Things vs. Stuff processing dichotomy, examining the separability of these two metacategories in performance, similarity, and quality disparities they produce in both primitive, random-noise filterbank recognizers as well as a parameter-intensive ensemble of WGAN-GP discriminators. In the primitive recognizers, Stuff is shown to be categorically easier to recognize than Things, and the filter kernel size is shown to matter much less than the choice of filterbank histogram statistic. A subordinal statistic, the signchain, is introduced, and it is shown to be comparably effective to retaining the bins themselves. As a separate emphasis, the discriminators of the GANsemble are shown jointly to retain the ability to perform ordinary classification, even though they are trained only to spot fakes and are trained only on one class, provided each network's activation in the ensemble undergoes a form of subtractive normalization. To visualize discriminability lost with GAN loss, an identical-architecture ensemble with cross-entropy loss, the NONGANsemble, is created as a comparison. The pairwise firing affinity on real and fake images and pairwise model space distance (i.e. MSE, p-norm, Jensen-Shannon, signchain, SSIM of the weights) matrices are inspected for both, showing the Stuff detectors to be more promiscuous, and the Thing vs. Stuff distinction to visibly emerge early or late in terms of network layers depending on the choice of loss function. NONGANs are shown to generalize better to unseen classes (permitting an effective omnibus classifier for objectness), but at a cost of not being able to control synthesis and being potentially less compressible. Performance is at ceiling in the deep recognizers for Things and Stuff, but the competent synthesis of Stuff happens systematically before Things, even when using transfer learning to retrain Thing GANs. The popular Inception Score used for GAN quality assessment is shown to be unusably biased against Stuff because Inceptionv3 was trained on Things, and Frechet Inception Distance is recommended in its place. Late in VGG16, Stuff classes occupy fewer filter channels but occupy them more fully. In the final chapter, Things and Stuff classes and networks are studied in terms of the separability of their MDS embeddings, and an algorithm and taxonomy was created that facilitates the conjecture that Stuff behaves more like a single Thing than all Things like a single kind of Stuff. Using interleaved rounds of MDS and Procrustes superimposition, the embedding of embeddings or metaembedding is introduced, visually reinforcing the main results found in earlier chapters (such as the fact that signchain distance resembles the Jensen-Shannon distance). Finally, the rankings produced by primitive and deep recognizers are combined in a higher-level embedding and also via the Borda Count method to produce a composite recognition difficulty ranking that supports Things being often harder to recognize than Stuff. The synthesis quality as estimated by FID is moderately correlated to ease of recognition, suggesting a "computational disfluency" account of image complexity as composite processing difficulty under fundamental operations (retrieval, segmentation, restoration, and destruction) is possible. This suggests Things and Stuff classes should not be naively combined in artificial vision systems, and should potentially be suspected to not be fully combined in natural vision systems.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View