The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

📅 2024-10-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Machine learning (ML) suffers from weak data curation practices and insufficient documentation of ethical, environmental, and data management information. Method: We systematically evaluated 60 datasets from the NeurIPS Datasets and Benchmarks Track (2021–2023), introducing bibliometric data cataloging theory from library and information science to ML for the first time. We developed a literature-driven, four-dimensional evaluation framework—assessing documentation completeness, ethical impact, environmental footprint, and data management—and designed an actionable, structured rubric alongside an open-source assessment toolkit. Contribution/Results: We released the first exemplar metadata repository showcasing best practices. Our analysis revealed widespread deficiencies across all four dimensions. Based on these findings, we formulated actionable guidelines for conference reviewers and community adoption. All artifacts—including framework, rubric, toolkit, and metadata—are openly shared to advance ML datasets toward higher quality, reusability, and standardization.

Technology Category

Application Category

📝 Abstract
Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.
Problem

Research questions and friction points this paper is trying to address.

Data Curation
Machine Learning
Ethical Considerations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Data Curation Framework
NeurIPS Dataset Quality Assessment
Machine Learning Data Practices
🔎 Similar Papers
No similar papers found.
E
Eshta Bhardwaj
Faculty of Information, University of Toronto, Digital Curation Institute, University of Toronto
H
Harshit Gujral
Department of Computer Science, University of Toronto
Siyi Wu
Siyi Wu
University of Toronto
Climate InformaticsHuman-Computer InteractionHuman-AI Collaboration
C
Ciara Zogheib
Faculty of Information, University of Toronto
Tegan Maharaj
Tegan Maharaj
Faculty of Information, University of Toronto
Christoph Becker
Christoph Becker
Professor of Information, University of Toronto
responsible computingsustainable computinghuman computer interactionpost-growthdegrowth