Skip to main content
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Preparing PDF Scientific Articles for Biomedical Text Mining


PDF files are not suitable for text mining and must be converted to a plain text format first. For our purpose, we needed text from PDF scientific articles along with section level identification like title, abstract and references. To this end, PDFX is a useful tool which converts PDFs for scientific articles into XMLs, but variability in text quality due to publishing and format of the articles result in incorrect XML that impedes accurate text mining. Additionally, we need to mine PDFs of different types of publications, including manuscripts, research letters, reports and articles which may have radically different formats. Hence we made an ensemble tool to post-process PDFX XMLs combining multiple sources of inputs from OCR texts, PDFBox and Entrez e-utilities API provided by PubMed to improve quality XMLs of journals. We were able to significantly improve the quality of XML with respect to fidelity of non- alphanumeric characters, segmentation of title, abstract, references and acknowledgment, along with correct word order in text, leading to a data set more suitable for text mining

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View