UC San Diego
Preparing PDF Scientific Articles for Biomedical Text Mining
- Author(s): Bhargava, Shitij
- et al.
PDF files are not suitable for text mining and must be converted to a plain text format first. For our purpose, we needed text from PDF scientific articles along with section level identification like title, abstract and references. To this end, PDFX is a useful tool which converts PDFs for scientific articles into XMLs, but variability in text quality due to publishing and format of the articles result in incorrect XML that impedes accurate text mining. Additionally, we need to mine PDFs of different types of publications, including manuscripts, research letters, reports and articles which may have radically different formats. Hence we made an ensemble tool to post-process PDFX XMLs combining multiple sources of inputs from OCR texts, PDFBox and Entrez e-utilities API provided by PubMed to improve quality XMLs of journals. We were able to significantly improve the quality of XML with respect to fidelity of non- alphanumeric characters, segmentation of title, abstract, references and acknowledgment, along with correct word order in text, leading to a data set more suitable for text mining