Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Previously Published Works bannerUC Berkeley

Hunting Data Rogues at Scale: Data Quality Control for Observational Data in Research Infrastructures

Abstract

Data quality control is one of the most time consuming activities within Research Infrastructures (RIs), especially when involving observational data and multiple data providers. In this work we report on our ongoing development of data rogues, a scalable approach to manage data quality issues for observational data within RIs. The motivation for this work started with the creation of the FLUXNET2015 dataset, which includes carbon, water, and energy fluxes plus micrometeorological and ancillary data measured in over 200 sites around the world. To create an uniform dataset, including derived data products, extensive work on data quality control was needed. The unpredictable nature of observational data quality issues makes the automation of data quality control inherently difficult. Developed based on this experience, the data rogues methodology allows for increased automation of quality control activities by systematically identifying, cataloging, and documenting implementations of solutions to data issues. We believe this methodology can be extended and applied to others domains and types of data, making the automation of data quality control a more tractable problem.

Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View