This presentation will provide an overview of our work to date on the 2021 LAUC Research Grant project, “A review of publishing and sharing practices for machine learning objects for informing library curation practices.” We will cover the motivation behind this project, our research methods, and preliminary findings from our work.
Machine learning (ML) is a field of study that combines computer science and statistical techniques to achieve goals by “learning” iteratively through experience, and is becoming more common across a range of disciplines. Creating an ML research object is resource intensive, often requiring large amounts of training and test data and processing power. In addition, ML reproducibility depends on rigorous documentation but often falls short as a consequence of incomplete and/or poorly described components (e.g., training data, source code, algorithms), properties (parameters, methods, workflows, provenance), and computing environments (software packages and versions). Broad sharing of ML outputs, if properly documented and organized with an eye towards reusability, can therefore make future research more efficient and reproducible. To date, formalized guidelines and recommended practices for documenting and sharing ML objects are scarce, at least within library-centric professions and generalist data repositories. We seek to inform our practice by learning about current norms and standards for ML objects, if any, and to share knowledge gained with the broader data curation community.
Our project includes a broad survey of ML objects on a selection of repositories that specialize in ML research workflows and outputs, as well as several generalist repositories. In conducting this broad scan of repositories for ML research objects and analyzing the provided metadata, we aim to identify “good” sharing practices as well as to assess whether the observed frequencies of these practices varies significantly among generalist repositories and across disciplines.