An Open Data Framework for the San Francisco Estuary

Author(s): Baerwald, Melinda R.; Davis, Brittany E.; Lesmeister, Sarah; Mahardja, Brian; Pisor, Rachel; Rinde, Jenna; Schreier, Brian; Tobias, Vanessa | Abstract:


INTRODUCTION
Transparency and reproducibility are cornerstones of the scientific method (Nosek et al. 2015), yet, in practice, scientific journal articles with readily accessible data and fully reproducible methods are uncommon (Hampton et al. 2013;Kidwell et al. 2016). To improve the dissemination of scientific information and data, there has been a push to change the way scientific information is shared (Reichman et al. 2011). Referred to as the "open science" movement, the objective is to encourage transparency throughout all stages of the scientific research process, including data availability; publication access; and detailed, transparent, and reproducible methods. The implementation of "open science" practices improves scientific rigor, encourages collaboration, and accelerates discoveries and innovations overall (Woelfle et al. 2011). Science is built on data, and accordingly, open data is a key aspect of the open science movement. Open data means that data are freely available to the public, and are described by sufficient documentation for appropriate reuse.
The benefits of open science and open data are not limited to altruistic improvements for the broad scientific community. Individual researchers and teams also directly benefit from using open practices. Open scientific publications have been shown to receive more citations and media coverage, and are associated with more job and funding opportunities (McKiernan et al. 2016). When data are reused regularly, as in the case of mandated monitoring programs, making data publicly available with supporting information can streamline data delivery (i.e., there is no need to wait for the data manager to send the data set) and reduce costs. Open access to data also helps data producers gain trust from both the scientific community and the public through transparency and reproducibility. In part to address the history of a lack of openness, the scientific community has responded by more frequently including open data requirements for peer-reviewed scientific journal publications, grant applications, and governmental entities. The purpose of this article is to highlight the benefits of open science and to share tools for scientists within the San Francisco Estuary (estuary) region to help streamline the process. We are not attempting to provide an overview of the open science movement in the field of ecology, as others have already written recent thorough reviews and perspectives on this topic (Molloy 2011;Hampton et al. 2015;Culina et al. 2018;Powers and Hampton 2019). Although many individuals and organizations collect data of various types and sizes, we highlight the estuary in particular because of the quantity of data that exists. Our own efforts to make environmental data from long-term monitoring programs and research studies in the estuary more accessible daylighted the challenges that many researchers face on their path to open data. To help future data providers make their data fully open, we used our collective experience to develop a set of steps for producing open data and briefly discuss why each step is important.

INTERAGENCY ECOLOGICAL PROGRAM DATA UTILIZATION WORK GROUP
One of the primary groups involved with the generation of long-term data sets of the estuary's natural resources is the Interagency Ecological Program (IEP). The IEP is a consortium of state and federal agencies that has been conducting collaborative ecological studies in the estuary since the 1970s to help inform the potential environmental effects of state and federal pumping operations that divert water from the estuary to the southern part of California. Information gathered by the IEP has been used many times to guide management actions whose goals are to help the recovery of endangered species that reside within the estuary. The IEP currently has over 30 long-term monitoring programs conducted by government agencies and universities, some of which started decades ago between the 1950s and the 1990s. This combined effort amounts to more than 500,000 independent sampling events over the years. Numerous studies that have played a key role in improving our scientific understanding of the estuary relied on these data sets, and information from these efforts have led to follow-up studies that range in topics from genetics, to physiology, to stable isotopes (Sommer et al. 2019).
Given the large spatial and temporal scale of the data collected by IEP, the range of topics covered, and the increasing demand for information, the IEP established in the late 1990s a work group focused on data management. Dubbed the Data Utilization Work Group (DUWG), this group's goal is to ensure that all data and information generated by IEP are of high quality and are consistent with management and science priorities. It does so by setting internal procedures and guidelines, defining and implementing shared data standards across member agencies, facilitating data sharing in a timely manner, and coordinating with other data management teams in the local science community. The DUWG's recent effort to set guidelines and coordinate open data protocols has helped with preparing for implementation of The Open and Transparent Water Data Act (AB 1755, https:// water.ca.gov/Programs/All-Programs/AB-1755) in 2016, which requires that all water data collected by governmental state agencies be published in an open data repository.
Data from monitoring programs provides information that ecosystem managers need to determine the status of a species and decide on management actions. However, for data analysts who turn monitoring data into information, locating such data can sometimes be an arduous task that involves finding the contact information of the data manager, formally requesting the data from the individual, and waiting to hear a reply. Even after successfully gathering data, those hoping to conduct analysis sometimes face the additional challenge of having little to no metadata to give context to the data. From the perspective of the data producer, concerns may arise about data requests, because data users who have little understanding of the design or nuances of the study or monitoring program may inadvertently misuse data. There is also concern that data requesters can selectively use the data to support personal or institutional goals or without crediting the agency that collected the data. To address these common concerns, the IEP DUWG has developed an open data framework to facilitate reuse and synthesis of IEP data and to aid other organizations or individuals that are interested in data sharing. This open data framework follows four simple steps: 1. Write a Data Management Plan (DMP) to ensure the future integrity of the data.
2. Develop a Quality Assurance/Quality Control (QA/QC) procedure to maintain the high quality of information.
3. Write a metadata document for data users and new members joining the study or monitoring program.
4. Publish the data set to allow replication of studies and to provide credit for the data producers.

Update and Repeat
Quality Assurance and Control

Data Management Plans
Metadata Data Publishing

Figure 1
Steps in the open data framework VOLUME 18, ISSUE 2, ARTICLE 1

Data Management Plans
Data Management Plans (DMP) are documents created to provide essential information on how data is administrated from raw collection to a final archived format. Formal DMPs are increasingly becoming required for federal and private competitive grants (Thoegersen 2015). DMPs are typically brief documents (a couple of pages) that describe the data life-cycle from data collection to storage, QA/QC procedures, where it will be archived, and steps for sharing (Michener 2015).
DMPs have five general components: program description, data description, data preservation, QA/QC, and data-sharing requirements.
(1) The program description should contain data set name, points of contact, and types of data collected or their data sources.
(2) The DMP's data description component briefly describes the metadata, any metadata standards used, and where data users can find the metadata document.
(3) Data preservation documents the data format (data type and file extension) and how they are stored, backed up, and archived. Data storage and archiving are similar and often used interchangeably, but, for datamanagement purposes, are different. Data storage typically refers to procedures for storing data in the original format in which the data were gathered or entered. Data archiving refers to procedures for the final, QA/QC-ed data for long-term sharing and preservation. Both data storage and data archiving should include back-up copies in two to three locations, such as an agency's server and a cloud network. (4) The QA/QC components briefly describe the standards or procedures used to qualify the data, and where additional documents such as standard operating procedures or a QA Program Plan can be located. (5) The data-sharing requirements section describes how and where the public can access the data, as well as how the data should be cited and any requirements for using the data.
DMPs provide many benefits as a planning and communication tool (Jones 2011). For planning, DMPs ensure transparency between data producers and consumers by outlining the steps of how the data will be generated and QA/QC-ed, and where it will be archived. As a communication tool, DMPs can help with onboarding new staff and avoiding unnecessary duplication, and, for long-term maintenance, promote strong metadata on how the data were processed and stored. DMPs can be used for short-term studies or long-term monitoring programs. They are also applicable to synthesis efforts to document the various data sources used in an effort, even when no original data are produced. Table 1 shows a DMP template developed by the IEP DUWG and more information about developing a DMP.

Quality Assurance and Control
Quality Assurance is a system of management activities that include planning, implementing, assessing, reporting, and continuously improving (EPA 2002a). A component of QA is QC, which is the system of technical activities whose purpose is to quantify data errors and/or level of uncertainty, and to determine the effect of those errors (EPA 2006). The foundation of QA is systematic planning, which directs efficient sample collection, early detection of errors, and better results and effective decision-making based on the research (EPA 2001a). Planning involves laying out the study objectives and identifying the data quality needed to meet those objectives, including appropriate data type, how the data will be used, performance criteria, QC activities, and how the data will be analyzed (EPA 2006). Quality Control activities systematically determine if the data are of the appropriate type, frequency, and quality needed to answer study questions and support resource-management decisions. This includes QC activities for each sampling, analysis, or measurement technique; for each method and procedure; for acceptance criteria; and for corrective actions (EPA 2001a;EPA 2002b).
Quality Assurance is an integral part of open data and involves every stage of a project's life-cycle. The US Environmental Protection Agency (EPA) has been at the forefront in developing quality management systems and directs local regulatory agencies such as the State Water Resources Control Board to follow those models; therefore, our recommendations are built on EPA guidance. Further, when the robust QA practices outlined in EPA guidance are supported at the institution through a Quality Management System (EPA 2001b), quality is better integrated into its culture. Key components of a quality management system include a QA policy, quality system documentation, annual assessments, training and education, systematic planning of projects, project-specific quality documentation, and project and data assessments (EPA 2001b). There are also tools that can be implemented on a project level to ensure data of known and documented quality. These include study plans and/or quality assurance project plans, training plans, and data assessments (EPA 2001a). The necessary QC activities may be identified during the planning process and may include documents such as checklists, instrument-calibration sheets, field sheets, laboratory chain-of-custodies and reports, and/or codes or scripts used in data-validation activities. All these QC activities, tools, and components are interwoven into each step of the data lifecycle, and are applicable to anyone involved with the data-from data collector to lab technician to data scientist. Table 1 shows additional resources for how to get started.
Having proper QA/QC processes allow data producers to communicate the quality, limitations, and appropriate use of the data. With this transparency, data users may be more coordinated in their methods of checking data quality, because they will know the methods originally used to check the data. Having this information at hand may significantly reduce the time spent in performing the data-cleansing activities required for data reuse (Vetrò et al. 2016).

Metadata
Metadata is information about data that describes the 'who, what, where, when, why, and how' of data sets. Metadata is a critical component of a data set because it enables users to conduct the most accurate data analysis and reduces the likelihood of misuse. There are several types of metadata (Zeng 2015;Riley 2017), ranging from: (1) information that describes how the data were collected and produced (process metadata); to (2) the actual content of the data, parameter ranges, and quality of the data (reference metadata); to (3) more general information such as the title, a set of key words, an abstract, and contacts that help better identify the data and make it discoverable (descriptive metadata), to, finally, (4) how the data are archived over the long-term, permissions, and property rights (administrative metadata). In addition to the various types of metadata, different formats of metadata (also called schemas) exist. While numerous scientists still generate metadata using a text document format (e.g., Microsoft Word), many agencies are requiring the use of standardized science or geospatial metadata formats such as Ecological Metadata Language (EML) (Jones et al. 2019) and Federal Geographic Data Committee (FDGC) schemas. Standardized metadata formats allow for greater readability and interoperability amongst diverse data sets by both scientists and machines (Duval et al. 2002). Both EML and FDGC are used by state and federal agencies within the IEP, and applications are available for interoperability (i.e., cross-walking) between formats. Table 1 shows a complete list of available schemas. Using standardized schemas, IEP data and metadata can be harvested (machine-read) and accessed on a variety of web-based data platforms. Standardized and accessible metadata increases discoverability, contributing to big-data synthesis projects, meeting state AB1755 protocols, and helping decrease data misuse.
With participation from state, federal, and program partners, the IEP DUWG developed recommendations for best metadata practices. Recommended practices include using standardized EML with inclusion of a robust methodology section that better describes field and data collection practices for water quality and biological data for long-term monitoring programs. Detailed metadata for longterm data sets that span decades are particularly important to capture changes in methodologies over time, which may affect analysis and integration of similar data sets used in synthesis efforts. In addition to recommendations for metadata, the IEP DUWG provides resources for creating standardized metadata. These resources include metadata Microsoft Word templates, as well as a number of important metadata elements or types (descriptive, process, reference, and administrative); R-scripts for creating standardized machine-readable EML metadata (in .xml formats); and instructions for how to generate and publish usable metadata (Table 1).

Data Publishing
Data publishing is, essentially, making data sets available for reuse by others. There are several best practices to consider for data publication. One is that a data set and its associated metadata are considered equally valuable and create a complete data package only when combined. The data package can be made public via a public website, a data repository, a journal focused on data publication, or as a supplement to a research article. The latter three allow the data package to obtain a Digital Object Identifier (DOI), which is a universally recognized alphanumeric sequence assigned by a registration agency (e.g., DataCite or Crossref) to provide a persistent link to the data package's location on the internet. Having a DOI assigned to a data package facilitates data discovery, usage tracking, documentation of data set versions, and retention of the relationship between a data set and its metadata. It is also recommended that machine-readable, stable, non-proprietary, and standardized data formats are used (e.g., .csv, .txt). Promoting discovery and reuse of a high-quality data package can reap later rewards through increased exposure and citations of your scientific research. Some possibilities include posting the data package or a hyperlink to it on data portals, including a hyperlink on a professional website; requesting it be a featured contribution in a data repository; and calling attention to it on your preferred professional social media site (e.g., Twitter). Promoting a high-quality data package can reap later rewards through increased exposure and citations of your scientific research.
Recently, IEP has made a concerted effort to publish and obtain DOIs for six of its monitoring programs and one discrete study (Table 2). This effort is ongoing, and in the next few years many more IEP data packages are anticipated to be added to IEP's data repository of choice (i.e., Environmental Data Initiative) or similar data repositories (Table 1).
Published data sets may range from short-term research studies to multi-decade monitoring programs. Regardless of the data type or scope, publishing is beneficial for data generators, data users, funders, decision-making managers, and the public. Data generators benefit by getting recognition for their scientific contribution outside of the typical peer-reviewed research article and can add the citation to their resume as a publication. Data generators and users both benefit from having data accessible to answer new questions and from becoming part of larger metaanalyses. Researchers participating in larger meta-analyses using accessible data can generate synthesis products that lead to broader discoveries (e.g., Stompe et al., this issue.) Maximizing the data's utility through publication and subsequent meta-analyses can promote increased knowledge sharing and collaborative opportunities, as well as reduce scientific redundancy to make funding dollars go further. The overall increase in efficiency and opportunities to broaden scientific scope can enable the more rapid and informed decision-making needed for a successful adaptive-management framework. The public can also benefit from data publication because the practice promotes increased transparency and data accessibility, and-in conjunction with quality metadata-increases scientific credibility. Given that the benefits of data publishing are numerous, we recommend that more effort be directed towards incentivizing it. For agencies, their executives, managers, and supervisors can strongly promote data publication by including it in job duty statements, by considering it a valued product during performance reviews, and by giving staff sufficient time to work on producing this type of publication.

OPEN DATA CHALLENGES
While the benefits of open data to researchers and the broader scientific community are substantial, scientists face roadblocks in making their data open.
Here, we list some of the most common barriers identified by IEP scientists, as well as some insight into how to move past these barriers. a. Data misuse concerns can be ameliorated by ensuring that robust metadata is included with any published data, and by ensuring that whatever online repository is used enables users to download all metadata with the data set.
b. Data plagiarism concerns can be addressed by ensuring that published data contain clear instructions for how to cite the data, and that journals require authors to properly cite any third-party data in manuscripts.
c. Scooping concerns can be addressed by carefully timing publications of both manuscripts and data, as well as the use of preprints. Preprints are complete draft manuscripts that are stored in a repository before journal submission or publication. Preprints are publicly available, and each one receives a DOI, allowing them to be cited. Having preprints for your work ensures that your ideas, study design, and results remain yours. Any potential scooping of a study via open data would be minimized, because other researchers would be more likely to pursue study questions or methods with your data that are distinct from your original study.

STARTING (OR CONTINUING) YOUR OPEN SCIENCE JOURNEY
The path to open data and, ultimately, open science can be intimidating. Fortunately, plentiful resources are available (e.g.,