Skip to main content
eScholarship
Open Access Publications from the University of California

These are the publications for The Center for Knowledge Infrastructures on eScholarship. We conduct research on scientific data practices and policy, scholarly communication, and socio-technical systems. 

Cover page of Towards a Field Guide to the (In)visible Labor of Data-intensive science

Towards a Field Guide to the (In)visible Labor of Data-intensive science

(2019)

We apply the concept of invisible labor, as developed by labor scholars over the last forty years, to the field of data-intensive science. Drawing on a fifteen-year corpus of data from multiple scientific disciplines on data-intensive science, we develop both a field guide to the invisible work of data-intensive science and a simple observational protocol intended as an aid to researchers studying data-intensive science. We conceptualize data-intensive science as an evolving field and highlight parallels in the labor literature and Science and Technology Studies, noting where data-intensive science intersects and overlaps with broader trends in the 21st century economy. In closing, we look towards changes in scientific labor on the near horizon, discussing how artificial intelligence and machine learning have begun to alter labor in industry. We also speculate on how the new technology will alter scientific labor and argue for the need to continually document and make visible the evolving forms of scientific organization and scientific labor the ground up.

Cover page of Once FITS, Always FITS? Astronomical Infrastructure in Transition

Once FITS, Always FITS? Astronomical Infrastructure in Transition

(2019)

The FITS file format has become the de facto standard for sharing, analyzing, and archiving astronomy data over the last four decades. FITS was adopted by astronomers in the early 1980s to overcome incompatibilities between operating systems. On the back of FITS’ success, astronomical data became both backwards compatible and easily shareable. However, new advances in astronomical instrumentation, computational technologies, and analytic techniques have resulted in new data that do not work well within the traditional FITS format. Tensions have arisen between the desire to update the format to meet new analytic challenges and adherence to the original edict for FITS files to be backwards compatible. We examine three inflection points in the governance of FITS: a) initial development and success, b) widespread acceptance and governance by the working group, and c) the challenges to FITS in a new era of increasing data and computational complexity within astronomy.

Cover page of The principles of tomorrow's university

The principles of tomorrow's university

(2018)

In the 21st Century, research is increasingly data- and computation-driven. Researchers, funders, and the larger community today emphasize the traits of openness and reproducibility. In March 2017, 13 mostly early-career research leaders who are building their careers around these traits came together with ten university leaders (presidents, vice presidents, and vice provosts), representatives from four funding agencies, and eleven organizers and other stakeholders in an NIH- and NSF-funded one-day, invitation-only workshop titled "Imagining Tomorrow's University." Workshop attendees were charged with launching a new dialog around open research – the current status, opportunities for advancement, and challenges that limit sharing.The workshop examined how the internet-enabled research world has changed, and how universities need to change to adapt commensurately, aiming to understand how universities can and should make themselves competitive and attract the best students, staff, and faculty in this new world. During the workshop, the participants re-imagined scholarship, education, and institutions for an open, networked era, to uncover new opportunities for universities to create value and serve society. They expressed the results of these deliberations as a set of 22 principles of tomorrow's university across six areas: credit and attribution, communities, outreach and engagement, education, preservation and reproducibility, and technologies.Activities that follow on from workshop results take one of three forms. First, since the workshop, a number of workshop authors have further developed and published their white papers to make their reflections and recommendations more concrete. These authors are also conducting efforts to implement these ideas, and to make changes in the university system.  Second, we plan to organise a follow-up workshop that focuses on how these principles could be implemented. Third, we believe that the outcomes of this workshop support and are connected with recent theoretical work on the position and future of open knowledge institutions.

Cover page of PhD Dissertation - From Open Data to Knowledge Production: Biomedical Data Sharing and Unpredictable Data Reuses

PhD Dissertation - From Open Data to Knowledge Production: Biomedical Data Sharing and Unpredictable Data Reuses

(2018)

Using a US consortium for data sharing as the primary field site, this three-year ethnographic research project examines the socio-technical, epistemic, and ethical challenges of making biomedical research data openly available and reusable. Public policy arguments for releasing scientific data for reuse by others include increasing trust in science and leveraging public investments in research. In most types of scientific research, data release occurs in parallel with associated publications, after peer-review. In the consortium studied for this project, datasets may also be released independently without an associated publication. Such research datasets are conceptualized as “hypothesis free” resources from which novel knowledge can be extracted indefinitely. Among the findings of this project are that biomedical researchers do not download and re-analyze “hypothesis free” research data from open repositories as a regular practice. Data reuse is a complex, delicate, and often time-consuming process. Metadata and ontology schemas appear to be necessary but not sufficient for data reuse processes. For scientists to test new hypotheses on “old” data, they depend on access to peer-reviewed primary analyses, pre-existing trusted relationships with the data creators, and shared research agendas. Data donors (patients, study participants, etc.), on the other hand, retain little control over how open research data are reused. Findings suggest that, in practice, it is impossible to predict – and consequently to regulate – how datasets might be reused once made openly available. Unintended consequences of reusing this consortium’s open data already are emerging, to the concern of some participants.

Cover page of Open data, grey data, and stewardship: Universities at the privacy frontier

Open data, grey data, and stewardship: Universities at the privacy frontier

(2018)

As universities recognize the inherent value in the data they collect and hold, they encounter unforeseen challenges in stewarding those data in ways that balance accountability, transparency, and protection of privacy, academic freedom, and intellectual property. Two parallel developments in academic data collection are converging: (1) open access requirements, whereby researchers must provide access to their data as a condition of obtaining grant funding or publishing results in journals; and (2) the vast accumulation of “grey data” about individuals in their daily activities of research, teaching, learning, services, and administration. The boundaries between research and grey data are blurring, making it more difficult to assess the risks and responsibilities associated with any data collection. Many sets of data, both research and grey, fall outside privacy regulations such as HIPAA, FERPA, and PII. Universities are exploiting these data for research, learning analytics, faculty evaluation, strategic decisions, and other sensitive matters. Commercial entities are besieging universities with requests for access to data or for partnerships to mine them. The privacy frontier facing research universities spans open access practices, uses and misuses of data, public records requests, cyber risk, and curating data for privacy protection. This Article explores the competing values inherent in data stewardship and makes recommendations for practice by drawing on the pioneering work of the University of California in privacy and information security, data governance, and cyber risk.

Cover page of Text data mining from the author's perspective: Whose text, whose mining, and to whose benefit?

Text data mining from the author's perspective: Whose text, whose mining, and to whose benefit?

(2018)

Given the many technical, social, and policy shifts in access to scholarly content since the early days of text data mining, it is time to expand the conversation about text data mining from concerns of the researcher wishing to mine data to include concerns of researcher-authors about how their data are mined, by whom, for what purposes, and to whose benefits.

Cover page of Open data, grey data, and stewardship: Universities at the privacy frontier

Open data, grey data, and stewardship: Universities at the privacy frontier

(2018)

As universities recognize the inherent value in the data they collect and hold, they encounter unforeseen challenges in stewarding those data in ways that balance accountability, transparency, and protection of privacy, academic freedom, and intellectual property. Two parallel developments in academic data collection are converging: (1) open access requirements, whereby researchers must provide access to their data as a condition of obtaining grant funding or publishing results in journals; and (2) the vast accumulation of 'grey data' about individuals in their daily activities of research, teaching, learning, services, and administration. The boundaries between research and grey data are blurring, making it more difficult to assess the risks and responsibilities associated with any data collection. Many sets of data, both research and grey, fall outside privacy regulations such as HIPAA, FERPA, and PII. Universities are exploiting these data for research, learning analytics, faculty evaluation, strategic decisions, and other sensitive matters. Commercial entities are besieging universities with requests for access to data or for partnerships to mine them. The privacy frontier facing research universities spans open access practices, uses and misuses of data, public records requests, cyber risk, and curating data for privacy protection. This paper explores the competing values inherent in data stewardship and makes recommendations for practice, drawing on the pioneering work of the University of California in privacy and information security, data governance, and cyber risk.

Cover page of Digital data archives as knowledge infrastructures: Mediating data sharing and reuse

Digital data archives as knowledge infrastructures: Mediating data sharing and reuse

(2018)

Digital archives are the preferred means for open access to research data. They play essential roles in knowledge infrastructures – robust networks of people, artifacts, and institutions – but little is known about how they mediate information exchange between stakeholders. We open the “black box” of data archives by studying DANS, the Data Archiving and Networked Services institute of The Netherlands, which manages 50+ years of data from the social sciences, humanities, and other domains. Our interviews, weblogs, ethnography, and document analyses reveal that a few large contributors provide a steady flow of content, but most are academic researchers who submit datasets infrequently and often restrict access to their files. Consumers are a diverse group that overlaps minimally with contributors. Archivists devote about half their time to aiding contributors with curation processes and half to assisting consumers. Given the diversity and infrequency of usage, human assistance in curation and search remains essential. DANS’ knowledge infrastructure encompasses public and private stakeholders who contribute, consume, harvest, and serve their data – many of whom did not exist at the time the DANS collections originated – reinforcing the need for continuous investment in digital data archives as their communities, technologies, and services evolve.

Cover page of Using the Jupyter Notebook as a tool for open science: An empirical study

Using the Jupyter Notebook as a tool for open science: An empirical study

(2017)

As scientific work becomes more computational and data-intensive, research processes and results become more difficult to interpret and reproduce. In this poster, we show how the Jupyter notebook, a tool originally designed as a free version of Mathematica notebooks, has evolved to become a robust tool for scientists to share code, associated computation, and documentation.

Cover page of On the Reuse of Scientific Data

On the Reuse of Scientific Data

(2017)

While science policy promotes data sharing and open data, these are not ends in themselves. Arguments for data sharing are to reproduce research, to make public assets available to the public, to leverage investments in research, and to advance research and innovation. To achieve these expected benefits of data sharing, data must actually be reused by others. Data sharing practices, especially motivations and incentives, have received far more study than has data reuse, perhaps because of the array of contested concepts on which reuse rests and the disparate contexts in which it occurs. Here we explicate concepts of data, sharing, and open data as a means to examine data reuse. We explore distinctions between use and reuse of data. Lastly we propose six research questions on data reuse worthy of pursuit by the community: How can uses of data be distinguished from reuses? When is reproducibility an essential goal? When is data integration an essential goal? What are the tradeoffs between collecting new data and reusing existing data? How do motivations for data collection influence the ability to reuse data? How do standards and formats for data release influence reuse opportunities? We conclude by summarizing the implications of these questions for science policy and for investments in data reuse.