Qin, Siyang

Text Spotting in the Wild

2018

Qin, Siyang
Advisor(s): Manduchi, Roberto

Abstract

Detecting and segmenting text in natural images is a challenging task which may find application in multiple scenarios, such as video surveillance, forensic, video annotation, mobile OCR. Our main interest in text spotting stems from its potential application as an assistive device for blind people. In this thesis, I propose two efficient and effective text detection system, a new text stroke segmentation algorithm with state-of-the-art performance, and a novel encoder-decoder network architecture that can automatically remove text from image.

The first text detection algorithm is a region-based method, designed in a bottom-up manner. Characters are first detected before grouped into words. To find each character, Maximally Stable Extremal Regions (MSERs) is used to propose a large number of candidate regions which then feed to a Convolutional Neural Network (CNN) to filter out background regions. To improve the robustness and avoid the "tricky" post-processing (character grouping) step of the previous method, a cascaded fully convolutional networks (FCN) is proposed to predict the location of each word directly by utilizing the wide range of context information.

Segmenting text stroke from its background can benefits optical character recognition (OCR) and other tasks. I propose the use of FCN and fully connected CRF with a novel pairwise kernel definition that includes stroke width information. In order to train the model, we create a new synthetic dataset with 100K text images. Our method outperforms the state-of-the-art algorithms while being more efficient.

Automatic removal of text or other objects from an image is considered an unsolved problem. It is challenging due to the fact that foreground segmentation is unknown, unlike the problem solved by traditional image inpainting algorithms which assume the known of where to reconstruct. In order to solve this challenging task, I introduce a novel encoder-decoder network architecture with two parallel and interconnected decoder branches, one designed to segment the foreground, the other to recover the missing background. The two decoders are connected via neglect nodes that determine which information from the encoder should be used for synthesis, and which should be neglected. The foreground text stroke segmentation and the synthesized background image are produced in a single forward pass.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Santa Cruz

Text Spotting in the Wild