RLVR Datasets and Where to Find Them: Tracing Data Lineage for Better Training Data

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the pervasive issues of data contamination and redundancy in reinforcement learning from verifiable rewards (RLVR) datasets, which stem from ambiguous provenance due to heterogeneous sources. The authors propose ATLAS, a novel framework that enables large-scale lineage tracing of 1.45 million samples with 99.7% coverage, linking them back to 20 atomic sources. Leveraging source-level counterfactual attribution (SCA), they construct DAPO++, a decontaminated dataset with high signal concentration. Data quality is systematically quantified through lineage-aware search, checkpoint-based comparative analysis, and a composite quality metric Q. Experiments demonstrate a strong correlation between Q scores and RLVR training efficacy, with DAPO++ significantly enhancing downstream task performance on the Qwen3 model. The findings reveal the high homogeneity of existing RLVR datasets and underscore the critical need for rigorously curated, high-quality data in RLVR research.

📝 Abstract

The proliferation of Reinforcement Learning from Verifiable Rewards (RLVR) datasets has exacerbated provenance collapse due to unclear lineage among existing datasets. To bridge this fragmented RLVR data landscape, we propose Atomic-source Tracing via Lineage-Aware Search (ATLAS), a systematic framework for tracing RLVR datasets back to their atomic sources, attributing over 99.7% of 1.45M instances to 20 atomic sources. Our analysis reveals that most RLVR datasets are variants of a small set of shared upstream sources, with few introducing genuinely new data, and many facing data contamination risks. These findings naturally motivate us to curate a new RLVR dataset, DAPO++, and to benchmark existing datasets from a lineage-aware perspective. To this end, we propose Source-level Counterfactual Attribution (SCA) as a guiding principle to curate a decontaminated training dataset with concentrated learning signals. Essentially, SCA measures a sample's marginal utility by comparing per-atomic-source RL checkpoints against a shared base model. Building upon these attribution signals, we further design a composite dataset quality score Q that strongly correlates with downstream RLVR performance. Experiments on Qwen3 series models verify that DAPO++ consistently improves performance on held-out benchmarks, while Q reliably predicts downstream RLVR training effectiveness. Our code and data is available at https://github.com/Celine-hxy/ATLAS.

Problem

Research questions and friction points this paper is trying to address.

RLVR

data lineage

provenance collapse

data contamination

dataset curation

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR

data lineage

ATLAS