This report, produced as part of the UCSF Archives and Special Collections Summer Fellowship program, explores the efficacy of Optical Character Recognition (OCR) technology in processing archival documents. OCR technology, which automates the extraction of text from images, has significantly advanced recently, providing substantial benefits for archival organizations by making vast amounts of previously “hidden” data more accessible. This study specifically examines the disparities in OCR quality between handwritten and typewritten documents, highlighting that OCR’s effectiveness is considerably lower for handwritten texts. This discrepancy results in biases and underrepresentation in datasets, particularly affecting the accessibility and utility of handwritten documents from historical archives.
Utilizing a dataset comprising documents related to AIDS/HIV activism from the 1980s and 1990s, this project evaluates the performance of three OCR tools—Tesseract, Google Cloud Document AI, and Amazon Textract—across different document types. The objective is to identify the most effective OCR solution for enhancing the accessibility of handwritten documents within the UCSF Archives and Special Collections. The findings aim to contribute to the broader archival field by addressing the challenges of digitizing and utilizing handwritten archival materials, thus supporting more inclusive and comprehensive historical research.
This report, developed as part of the 2024 UCSF Industry Documents Library Undergraduate Summer Fellowship, examines four distinct projects that leverage natural language processing and data science within the context of the JUUL Labs Collection and the broader IDL. Project One investigates the optical character recognition (OCR) accuracy of low-quality and handwritten documents in the absence of ground truth data. Project Two explores the implementation of embedding search algorithms and visualizations aimed at enhancing the relevance of document recommendations for users. Project Three employs txt-ferret to conduct a thorough scan of a substantial corpus of industry documents to identify sensitive information, including credit card numbers. Finally, Project Four assesses the biases present in large language model (LLM) summarization through the lens of sentiment analysis.