OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The composition, provenance, and evaluation of post-training data for large language models (LLMs) remain opaque, leading to ambiguous performance attribution and irreproducible experiments. Method: We introduce the first open-source, fairness-aware platform for post-training data value assessment, featuring a multidimensional quality scoring framework, an interactive data lineage explorer, and a unified training–inference–evaluation pipeline—advancing data assessment from empirical trial-and-error toward data-centric science. Contribution/Results: The platform supports diverse LLMs (e.g., Llama, Qwen) and 22 cross-domain benchmarks, encompassing 120+ datasets and 600+ training experiments. It uncovers novel empirical regularities: (i) the trade-off between data complexity and model performance, (ii) benchmark redundancy, and (iii) correlations between data lineage structure and downstream performance. All tooling, configurations, and experimental results are fully open-sourced.

Technology Category

Application Category

📝 Abstract
The rapid evolution of Large Language Models (LLMs) is predicated on the quality and diversity of post-training datasets. However, a critical dichotomy persists: while models are rigorously benchmarked, the data fueling them remains a black box--characterized by opaque composition, uncertain provenance, and a lack of systematic evaluation. This opacity hinders reproducibility and obscures the causal link between data characteristics and model behaviors. To bridge this gap, we introduce OpenDataArena (ODA), a holistic and open platform designed to benchmark the intrinsic value of post-training data. ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models (e.g., Llama, Qwen) and domains; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources; and (iv) a fully open-source toolkit for training, evaluation, and scoring to foster data research. Extensive experiments on ODA--covering over 120 training datasets across multiple domains on 22 benchmarks, validated by more than 600 training runs and 40 million processed data points--reveal non-trivial insights. Our analysis uncovers the inherent trade-offs between data complexity and task performance, identifies redundancy in popular benchmarks through lineage tracing, and maps the genealogical relationships across datasets. We release all results, tools, and configurations to democratize access to high-quality data evaluation. Rather than merely expanding a leaderboard, ODA envisions a shift from trial-and-error data curation to a principled science of Data-Centric AI, paving the way for rigorous studies on data mixing laws and the strategic composition of foundation models.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking post-training datasets lacks standardized evaluation methods
Current data composition is opaque with unclear provenance and quality
No systematic framework links data characteristics to model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

A unified pipeline for fair model and domain comparisons
A multi-dimensional framework scoring data quality across axes
An open-source toolkit for training, evaluation, and lineage exploration
🔎 Similar Papers
No similar papers found.
M
Mengzhang Cai
Shanghai Artificial Intelligence Laboratory, OpenDataLab
X
Xin Gao
Shanghai Artificial Intelligence Laboratory, OpenDataLab
Y
Yu Li
Shanghai Artificial Intelligence Laboratory, OpenDataLab
Honglin Lin
Honglin Lin
SJTU
Z
Zheng Liu
Shanghai Artificial Intelligence Laboratory, OpenDataLab
Zhuoshi Pan
Zhuoshi Pan
Tsinghua University
deep learningnatural language processing
Qizhi Pei
Qizhi Pei
PhD Student, Gaoling School of Artificial Intelligence, Renmin University of China
LLMData SynthesisAI4Science
X
Xiaoran Shang
Shanghai Artificial Intelligence Laboratory, OpenDataLab
M
Mengyuan Sun
Shanghai Artificial Intelligence Laboratory, OpenDataLab
Zinan Tang
Zinan Tang
Undergraduate, Beijing University of Posts and Telecommunications
NLPLLMMLDataReasoning
X
Xiaoyang Wang
Shanghai Artificial Intelligence Laboratory, OpenDataLab
Z
Zhanping Zhong
Shanghai Artificial Intelligence Laboratory, OpenDataLab
Y
Yun Zhu
Shanghai Artificial Intelligence Laboratory, OpenDataLab
Dahua Lin
Dahua Lin
The Chinese University of Hong Kong
computer visionmachine learningprobabilistic inferencebayesian nonparametrics
Conghui He
Conghui He
Shanghai AI Laboratory
Data-centric AILLMDocument Intelligence
Lijun Wu
Lijun Wu
Shanghai AI Laboratory
MLLLMAI4Science