There is a need to establish a new publishing paradigm to cope with the deluge of data artifacts produced by data-intensive science, many of which are vital to data re-use and verification of published scientific conclusions. Due to the limitations of traditional publishing, most of these artifacts are not usually disseminated, cited, or preserved. These latent artifacts consist largely of datasets and data processing information that together form the foundations of the reasoned analyses that appear in the published literature. But this traditional record of science increasingly represents only the tip of the scientific iceberg.
One promising approach to this problem of data invisibility is to wrap these artifacts in the metaphor of a “data paper”, a somewhat unfamiliar bundle of scholarly output with a familiar facade. As envisioned, a data paper minimally consists of a cover sheet and a set of links to archived artifacts. The cover sheet contains familiar elements such as title, authors, date, abstract, and persistent identifier (e.g., a DOI or ARK) — just enough to permit basic exposure to and discovery of data by internet search engines; also just enough to build a basic data citation, to instill confidence in the identifier’s stability, and to be picked up by indexing services such as Google Scholar.
This simple format represents only the first stage of the evolution of the data paper. There is room for the format to increase in complexity with the incorporation of other valuable elements, both general-purpose and discipline-specific, to enrich discovery, re-use, and archiving. An exciting potential outcome of this development of the data paper as publication is the parallel emergence of a new kind of “data journal”. Like regular journals, data journals would spring up around disciplines and sub-disciplines as needed, and we could expect that some of them would also be peer-reviewed. The data journal is envisioned as an “overlay” journal; an editor would assemble an issue by selecting data papers from any number of source and archives, and combining them with front matter, a table of contents, editorial policies, submission guidelines, etc.
This new data publishing paradigm promises to strengthen the scientific community practices of data sharing, re-use, and preservation. Scientists want to do science, get credit for it, communicate about it with their peers, and improve the measurable outputs by which their funders and employers evaluate their performance. The elements of the data paper create a recognizable and standardized form for previously unpublished data artifacts, making them easier to approach, evaluate, and automatically index for basic discovery purposes. Those same elements can easily be repurposed to create familiar-looking citations suitable for reference in CVs and all manner of publication. Finally, unique persistent identifiers for data papers and data artifacts greatly facilitate automatic discovery of a data paper’s impact and re-use.