Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Contemporary dataset papers frequently suffer from limited originality, insufficient diversity, inadequate quality control, and poor transparency regarding construction methodologies; existing datasheets are largely descriptive and lack quantifiable evaluation criteria or enforceable accountability mechanisms. Method: We propose DataRubrics, the first rubric-based framework for structured data quality assessment, integrating LLM-as-a-judge (e.g., GPT-4) with synthetic data techniques to enable automated, reproducible, and standardized quality scoring for both human- and model-generated datasets. Contribution/Results: The framework delivers an open-source evaluation toolkit (github.com/datarubrics/datarubrics), facilitating collaborative, measurable data review by reviewers and authors alike. It significantly enhances rigor, transparency, and trustworthiness in data-centric research through objective, interpretable, and auditable quality metrics.

Technology Category

Application Category

📝 Abstract
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.
Problem

Research questions and friction points this paper is trying to address.

Lack standardized metrics for dataset quality evaluation
Insufficient transparency in dataset construction details
Need scalable methods for synthetic data generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-based evaluation metrics for datasets
Synthetic data generation using LLM-as-a-judge
DataRubrics framework for scalable quality assessment
🔎 Similar Papers
No similar papers found.
G
G. Winata
Capital One
David Anugraha
David Anugraha
Stanford University
Machine LearningNatural Language ProcessingMultimodalityArtificial Intelligence
Emmy Liu
Emmy Liu
PhD Student, Carnegie Mellon University
Alham Fikri Aji
Alham Fikri Aji
MBZUAI, Monash Indonesia
MultilingualityLow-resource NLPLanguage ModelingMachine Translation
S
Shou-Yi Hung
University of Toronto
Aditya Parashar
Aditya Parashar
University of Massachusetts Amherst
Artificial Intelligence
Patrick Amadeus Irawan
Patrick Amadeus Irawan
MBZUAI, SMU
Natural Language ProcessingVision LanguageMultimodalityInterpretability
Ruochen Zhang
Ruochen Zhang
Brown University
Multilingual NLPInterpretabilityCode-Switching
Zheng-Xin Yong
Zheng-Xin Yong
Brown University
Machine Learning
Jan Christian Blaise Cruz
Jan Christian Blaise Cruz
MBZUAI, McGill University, Mila - Quebec AI Institute
Natural Language ProcessingTranslationMultilingualityLow-resource LanguagesCode Switching
Niklas Muennighoff
Niklas Muennighoff
Stanford University
large language modelsartificial intelligencemachine learning
Seungone Kim
Seungone Kim
Carnegie Mellon University
Large Language ModelsNatural Language Processing
H
Hanyang Zhao
Columbia University
Sudipta Kar
Sudipta Kar
Principal Applied Scientist at Oracle Health AI
Artificial IntelligenceNatural Language ProcessingMachine LearningDeep Learning
K
K. E. Suryoraharjo
University of Toronto
M
M. F. Adilazuarda
MBZUAI
En-Shiun Annie Lee
En-Shiun Annie Lee
Ontario Tech University, and University of Toronto (Status-Only)
Natural Language ProcessingData MiningPattern Analysis
Ayu Purwarianti
Ayu Purwarianti
Associate Professor, Informatics, Institut Teknologi Bandung, Indonesia
Computational LinguisticsMachine Learning
D
D. Wijaya
Monash University
Monojit Choudhury
Monojit Choudhury
Professor of Natural Language Processing, MBZUAI
Natural Language ProcessingLarge Language ModelsEthics of AIComputational Social Science