UC San Diego
Enhanced Convolutional Neural Networks and Their Application to Photo Optical Character Recognition
- Author(s): Lee, Chen-Yu
- Advisor(s): Lee, Zhuowen
- Cosman, Pamela
- et al.
This thesis presents two principled approaches to improve the performance of convolutional neural networks on visual recognition and demonstrates the effectiveness of CNNs on optical character recognition problem. First, we propose deeply-supervised nets (DSN), a method that simultaneously minimizes classification error and improves the directness and transparency of the hidden layer learning process. We focus our attention on three aspects of traditional CNN-type architectures: (1) transparency in the effect intermediate layers have on overall classification; (2) discriminativeness and robustness of learned features, especially in early layers; (3) training effectiveness in the face of ``vanishing'' gradients. To combat these issues, we introduce ``companion'' objective functions at each hidden layer, in addition to the overall objective function at the output layer.
Second, we seek to improve deep neural networks by generalizing the pooling operations that play a central role in current architectures. The two primary directions lie in (1) learning a pooling function via combining of max and average pooling, and (2) learning a pooling function in the form of a tree-structured fusion of pooling filters that are themselves learned. In our experiments every generalized pooling operation we explore improves performance when used in place of average or max pooling. The advantages provided by the proposed methods are evident in our experimental results, showing state-of-the-art performance on MNIST, CIFAR-10, CIFAR-100, and SVHN.
Finally, we present recursive recurrent neural networks with attention modeling for lexicon-free optical character recognition in natural scene images. The primary advantages of the proposed method are: (1) use of recursive convolutional neural networks (CNNs), which allow for parametrically efficient and effective image feature extraction; (2) an implicitly learned character-level language model, embodied in a recurrent neural network which avoids the need to use N-grams; and (3) the use of a soft-attention mechanism, allowing the model to selectively exploit image features in a coordinated way, and allowing for end-to-end training within a standard backpropagation framework. We validate our method with state-of-the-art performance on challenging benchmark datasets: Street View Text, IIIT5k, ICDAR and Synth90k.