Computer perception is one of the fundamental problems in artificial intelligence. Given an image or a recorded audio, a human can quickly recognize and detect objects based on image or sound or both. In computer vision, Object Detection is concerned with recognizing objects in images and drawing a bounding box around them. Researchers have been working on developing algorithms to recognize, detect, and segment objects/scenes in images for decades. Numerous challenges make these problems significantly challenging in real-world scenarios, since objects usually appear in different conditions, such as viewpoints, scales, and with background noise, and they even may deform into different shapes, parts, or poses. Real-time object detection has many important applications, such as autonomous driving cars and video surveillance. In this dissertation, we approach visual understanding in the following ways: First, we utilize implicit information in trained neural networks to localize all objects of interest in an image using a sensitivity analysis approach. Second, we introduce a novel framework for object detection called “Ventral- Dorsal” Neural Networks, inspired by the structure of the human brain. Third, we expand the Ventral-Dorsal framework, focusing on attaining real-time performance needed for online applications. Forth, we compare human attention with deep neural network attention algorithms in order to understand whether neural network attention matches human attention. Also, auditory perception is crucial in artificial intelligence systems. Until recently, auditory object recognition pipelines were in need of substantial hand engineering for feature extraction. Engineered features need to be tuned for every individual problem. Also, some popular feature extraction methods are time-consuming, limiting real-time applications.
Here we attempt to avoid these problems using end-to-end training. Due to the recent improvements in deep neural networks, we are able to eliminate feature learning by optimizing feature extraction and classification jointly in one network. In this dissertation, we approach the auditory object recognition problem in the following ways: we proposed a novel “end-to-end” deep neural network architecture that takes raw audio as input and maps it to class labels. We also applied our proposed architecture to a new dataset of infant vocalization sounds for further investigation.