- Main
exKidneyBERT: a language model for kidney transplant pathology reports and the crucial role of extended vocabularies.
Published Web Location
https://doi.org/10.7717/peerj-cs.1888Abstract
BACKGROUND: Pathology reports contain key information about the patients diagnosis as well as important gross and microscopic findings. These information-rich clinical reports offer an invaluable resource for clinical studies, but data extraction and analysis from such unstructured texts is often manual and tedious. While neural information retrieval systems (typically implemented as deep learning methods for natural language processing) are automatic and flexible, they typically require a large domain-specific text corpus for training, making them infeasible for many medical subdomains. Thus, an automated data extraction method for pathology reports that does not require a large training corpus would be of significant value and utility. OBJECTIVE: To develop a language model-based neural information retrieval system that can be trained on small datasets and validate it by training it on renal transplant-pathology reports to extract relevant information for two predefined questions: (1) What kind of rejection does the patient show?; (2) What is the grade of interstitial fibrosis and tubular atrophy (IFTA)? METHODS: Kidney BERT was developed by pre-training Clinical BERT on 3.4K renal transplant pathology reports and 1.5M words. Then, exKidneyBERT was developed by extending Clinical BERTs tokenizer with six technical keywords and repeating the pre-training procedure. This extended the models vocabulary. All three models were fine-tuned with information retrieval heads. RESULTS: The model with extended vocabulary, exKidneyBERT, outperformed Clinical BERT and Kidney BERT in both questions. For rejection, exKidneyBERT achieved an 83.3% overlap ratio for antibody-mediated rejection (ABMR) and 79.2% for T-cell mediated rejection (TCMR). For IFTA, exKidneyBERT had a 95.8% exact match rate. CONCLUSION: ExKidneyBERT is a high-performing model for extracting information from renal pathology reports. Additional pre-training of BERT language models on specialized small domains does not necessarily improve performance. Extending the BERT tokenizers vocabulary library is essential for specialized domains to improve performance, especially when pre-training on small corpora.
Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-