- Main
ScienceSearch: Enabling Search through Automatic Metadata Generation
Published Web Location
https://doi.org/10.1109/escience.2018.00025Abstract
Scientific facilities are increasingly generating and handling large amounts of data from experiments and simulations. Next-generation scientific discoveries rely on insights derived from data, especially across domain boundaries. Search capabilities are critical to enable scientists to discover datasets of interest. However, scientific datasets often lack the signals or metadata required for effective searches. Thus, we need formalized methods and systems to automatically annotate scientific datasets from the data and its surrounding context. Additionally, a search infrastructure needs to account for the scale and rate of application data volumes. In this paper, we present ScienceSearch, a system infrastructure that uses machine learning techniques to capture and learn the knowledge, context, and surrounding artifacts from data to generate metadata and enable search. Our current implementation is focused on a dataset from the National Center for Electron Microscopy (NCEM), an electron microscopy facility at Lawrence Berkeley National Laboratory sponsored by the Department of Energy which supports hundreds of users and stores millions of micrographs. In this paper, we describe a) our search infrastructure and model, b) methods for generating metadata using machine learning techniques, and c) optimizations to improve search latency, and deployment on an HPC system. We demonstrate that ScienceSearch is capable of producing valid metadata for NCEM's dataset and providing low-latency good quality search results over a scientific dataset.
Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-