Information retrieval techniques have been applied to biomedical research for a variety of purposes, such as textual document retrieval and molecular data retrieval. As biomedical research evolves over time, information retrieval is also constantly facing new challenges, including the growing number of available data, the emerging new data types, the demand for interoperability between data resources, and the change of users’ search behaviors. To help solve the challenges, I studied three solutions in my dissertation: (a) using information collected from online resources to enrich the representation models for biomedical datasets; (b) exploring rule-based and deep learning-based methods to help users formulate effective queries for both dataset retrieval and publication retrieval; and (c) developing a “retrieval plus re-ranking” strategy to identify relevant datasets, and rank them using customized ranking models.
In a biomedical dataset retrieval study, we developed a pipeline to automatically analyze users’ free-text requests, and rank relevant datasets using a “retrieval plus re-ranking” strategy. To improve the representation model of biomedical datasets, we explored online resources and collected information to enrich the metadata of datasets. The rule-based query formulation module extracted keywords from users’ free-text requests, expanded the keywords using NCBI resources, and finally formulated Boolean queries using pre-designed templates. The novel “retrieval plus re-ranking” strategy captured relevant datasets in the retrieval step, and ranked datasets using the customized relevance scoring functions that model unique properties of the metadata of biomedical datasets. The solutions proved to be successful for biomedical dataset retrieval, and the pipeline achieved the highest inferred Normalized Discounted Cumulative Gain (infNDCG) score in the 2016 bioCADDIE Biomedical Dataset Retrieval Challenge.
In a biomedical publication retrieval study, we developed the eXtended PubMed Related Citation (XPRC) algorithm to find similar articles in PubMed. Currently, similar articles in PubMed are determined by the PubMed Related Citation (PRC) algorithm. However, when the distributions of term counts are similar between articles, the PRC algorithm may conclude that the articles are similar, even though they may be about different topics. On the other hand, when two articles discuss the same topic but use different terms, the PRC algorithm may miss the similarity. For the above problem, we implemented a term expansion method to help capture the similarity. Unlike popular ontology-based expansion methods, we used a deep learning method to learn distributed representations of terms over one million articles from PubMed Central, and identified similar terms using the Euclidean distance between distributed representation vectors. We showed that, under certain conditions, using XPRC can improve precision, and helps find similar articles from PubMed.
In conclusion, information retrieval techniques in biomedical research have helped researchers find desired publications, datasets, and other information. Further research on developing robust representation models, intelligent query formulation systems, and effective ranking models will lead to smarter and more friendly information retrieval systems that will further promote the transformation from data to knowledge in biomedicine.