Recent works like BERT, GPT, ELMO, ULMFiT have successfully demonstrated the effectiveness of pretraining for a variety of downstream NLP tasks. We propose to use a similar approach for a different learning scenario - semi-supervised OCR/text recognition.
We hypothesize that for the supervised character image classifier of an OCR system, it is more effective to classify pre-learned OCR representations of character images rather than learn to do OCR from scratch in the traditional sense. For this purpose, components of the classifier are pretrained in an unsupervised fashion to consume a sequence of character images obtained by segmenting an image of a line of text, and reconstruct the same at the output. The pretraining is optimized for the masked language modeling objective, without access to the OCR labels for the character images.
In our results, we show that a classifier trained on top of the pretrained representations achieves almost 100\% higher accuracy than a classifier trained from scratch, when supervised with just 5 line images. Our pretraining procedure can leverage the large amount of unsupervised data to learn useful OCR representations and enhances the performance of supervised OCR systems, especially when supervised data is scarce like in historical documents.