Skip to main content
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Relation Extraction using Convolution Neural Networks for curation of GWAS catalog

  • Author(s): Goyal, Ankit
  • Advisor(s): Hsu, Chun-Nan
  • et al.

A crucial area of Natural Language Processing is information extraction, the study of the identification and extraction of concepts of interest ("genes", "diseases", etc.). This thesis proposes algorithms that extract relational information from biomedical text using machine learning techniques. In particular, the work presented here concerns with the identification of entity mentions from the given text which exhibits a semantic relationship among them and extraction of these entities for the curation of biomedical databases. One such database is the Genome-Wide Association Study (GWAS) catalog which is manually curated, literature-derived collection of all GWAS and is the center of our work.

This work presents a machine learning approach to natural language processing to automatically extract the information of GWAS catalog from a new biomedical text. We focus on characteristics of the population samples used in the experiments i.e. the experimental stage, the ethnicity groups of individuals and the size of the population pool. Our approach for relation extraction is based on convolutional neural networks with different filter sizes using already curated data from existing biomedical databases as training examples. Although these neural networks have been previously used for relation extraction and other natural language processing tasks, to the best of my knowledge they have never been applied to the problem of automatic data curation, and we focus primarily on developing a learning framework to deal with this issue specifically. We evaluated our approach by extracting the sample characteristics as tuple relations and achieved an improvement over the existing approach. Our neural network models were able to outperform an approach developed previously for the same task as a baseline.

Main Content
Current View