Incorporating human visual properties into neural network models
Skip to main content
Open Access Publications from the University of California

UC Santa Barbara

UC Santa Barbara Electronic Theses and Dissertations bannerUC Santa Barbara

Incorporating human visual properties into neural network models


Many animals and humans process the visual field with varying spatial resolution (foveated vision) and use peripheral processing to make intelligent eye movements and point the fovea to acquire high-resolution information about objects of interest. A foveated architecture results in computationally efficient rapid scene exploration and can result in energy savings for computer vision. A foveated model can also serve as a proxy to identify circumstances in which humans might make an error and as a tool to understand human vision. Foveated architectures have been implemented into previous computer vision models. However, they have not been explored with the latest computer vision architecture transformer networks, which result in better robustness against adversarial attacks and better representation of spatial relationships across the entire image.

We propose foveated computational module for object classification (FoveaTer) and object detection (FST) integrated into the vision transformer architecture. We evaluate FoveaTer’s computational savings and gains in robustness to adversarial attacks relative to a full-resolution model. We used the self-attention weights to optimize the guidance of the model eye movements. We have also investigated using FoveaTer to predict the various behavioral effects of humans. We performed a psychophysics experiment for the scene categorization task and predicted the human categorization performance using the FoveaTer model. Using two additional psychophysics experiments, a forced-fixation mouse recognition experiment to detect mouse in the visual periphery and a visual search experiment to detect mouse using a limited number of fixations, We have also evaluated how the FST model uses contextual information to guide eye movements like humans.

In addition, we trained anthropomorphic CNN models to detect simulated tumors in simulated 3D Digital Breast Tomosynthesis phantoms and compare their performance and errors against that of radiologists. We provide preliminary results on extending the FST model for tumor search in virtual mammograms.

Thus, the contributions of the dissertation are to further the implementation of computational cost savings for computer vision, to predict perceptual errors of humans, and provide a computational tool to study human vision/cognitive science in the wild.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View