Toward Enhanced Reusability: A Comparative Analysis of Metadata for Machine Learning Objects and Their Characteristics in Generalist and Specialist Repositories
Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Previously Published Works bannerUC San Diego

Toward Enhanced Reusability: A Comparative Analysis of Metadata for Machine Learning Objects and Their Characteristics in Generalist and Specialist Repositories

Published Web Location

https://doi.org/10.7191/jeslib.685
No data is associated with this publication.
Creative Commons 'BY' version 4.0 license
Abstract

Objective: The rapidly increasing prevalence and application of machine learning (ML) across disciplines creates a pressing need to establish guidance for data curation professionals. However, we must first understand the characteristics of ML-related objects shared in generalist and specialist repositories and the extent to which repository metadata fields enable findability and reuse of ML objects. Methods: We used a combination of API queries and web scraping to retrieve metadata for ML objects in eight commonly used generalist and ML-specific data repositories. We assessed both metadata schema and characteristics of deposited ML objects, within the context of the widely adopted FAIR Principles. We also calculated summary statistics for properties of objects, including number of objects per year, dataset size, domains represented, and availability of related resources. Results: Generalist repositories excelled at providing provenance metadata, specifically unique identifiers, unambiguous citations, clear licenses, and related resources, while specialist repositories emphasized ML-specific descriptive metadata, such as number of attributes and instances and task type. In terms of object content, we noted a wide range of file formats, as well as licenses, all of which impact reusability. Conclusions: Generalist repositories will benefit from some of the practices adopted by specialists, and specialist repositories will benefit from adopting proven data curation practices of generalist repositories. A step forward for repositories will be to invest more into use of labels and persistent identifiers to improve workflow documentation, provenance, and related resource linking of ML objects, which will increase their findability, interoperability, and reusability.

Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.

Item not freely available? Link broken?
Report a problem accessing this item