Skip to main content
eScholarship
Open Access Publications from the University of California

UCSF Library

Archives & Special Collections Projects bannerUCSF

Silence in OCR: What Could Handwritten Documents Tell Us?

The data associated with this publication are within the manuscript.
Creative Commons 'BY-NC' version 4.0 license
Abstract

This report, produced as part of the UCSF Archives and Special Collections Summer Fellowship program, explores the efficacy of Optical Character Recognition (OCR) technology in processing archival documents. OCR technology, which automates the extraction of text from images, has significantly advanced recently, providing substantial benefits for archival organizations by making vast amounts of previously “hidden” data more accessible. This study specifically examines the disparities in OCR quality between handwritten and typewritten documents, highlighting that OCR’s effectiveness is considerably lower for handwritten texts. This discrepancy results in biases and underrepresentation in datasets, particularly affecting the accessibility and utility of handwritten documents from historical archives.

Utilizing a dataset comprising documents related to AIDS/HIV activism from the 1980s and 1990s, this project evaluates the performance of three OCR tools—Tesseract, Google Cloud Document AI, and Amazon Textract—across different document types. The objective is to identify the most effective OCR solution for enhancing the accessibility of handwritten documents within the UCSF Archives and Special Collections. The findings aim to contribute to the broader archival field by addressing the challenges of digitizing and utilizing handwritten archival materials, thus supporting more inclusive and comprehensive historical research.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View