Synthetic Data RL: Task Definition Is All You Need

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) fine-tuning for large language models typically relies heavily on labor-intensive, high-quality human-annotated data—a major bottleneck in scalability and cost. Method: This paper proposes a fully automated, task-definition-driven synthetic data RL framework: given only a formal task description, it autonomously generates high-quality question-answer pairs; incorporates a model-capability-aware, difficulty-adaptive generation mechanism; and dynamically selects training samples for PPO fine-tuning based on average pass rate. Contribution/Results: The framework eliminates manual annotation entirely, unifying retrieval augmentation, difficulty control, and sample selection into a closed-loop automation pipeline. Experiments demonstrate substantial improvements across diverse benchmarks—+29.2% on GSM8K, +8.7% on MATH, +13.1% on GPQA, +8.9% on MedQA, and +17.7%/+13.7% on legal/financial CQA and CFA datasets—matching the performance of full-scale human-annotated RL fine-tuning. With only 100 human samples, gains are marginal (+0.4 percentage points), confirming drastic reduction in annotation dependency.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.
Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on human-labeled data for RL model adaptation
Generating synthetic training data from task definitions automatically
Improving model performance across diverse domains with synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic data from task definitions
Adapts question difficulty based on model solvability
Uses average pass rate for RL training
🔎 Similar Papers
No similar papers found.
Y
Yiduo Guo
Peking University
Z
Zhen Guo
MIT
C
Chuanwei Huang
Peking University
Z
Zi-Ang Wang
Peking University
Z
Zekai Zhang
Peking University
H
Haofei Yu
UIUC
Huishuai Zhang
Huishuai Zhang
Peking University
Deep LearningOptimizationInformation Theory
Yikang Shen
Yikang Shen
xAI
Deep LearningNatural Language Processing