UC San Diego
Learning from the Catalog of GWAS to Extract Population Characteristics
- Author(s): Tumkur, Kashyap Ravi
- et al.
The Genome-Wide Association Study (GWAS) Catalog is a manually curated, literature-derived collection of all GWAS. This thesis describes a general approach to using this curated data as training examples to extract the characteristics of population samples in GWAS, i.e., the experimental stage, ethnicity groups of the individuals in the populations involved, and the numeric sizes of the sample population pools. As using curated data in Machine Learning for Natural Language Processing is challenging due to the lack of annotations, we formulate the problem as cost-sensitive learning from noisy labels, where the cost is estimated by a committee that considers both curated data and the text. We evaluate this approach on the two distinct problems of extracting sample characteristics as relations of the form (stage, ethnicity) and (stage, ethnicity, size). We obtain macro F1 scores greater than 0.8 and 0.7 for the two tasks respectively, outperforming similar but cost-insensitive techniques