🤖 AI Summary
Reinforcement learning (RL) fine-tuning for large language models typically relies heavily on labor-intensive, high-quality human-annotated data—a major bottleneck in scalability and cost.
Method: This paper proposes a fully automated, task-definition-driven synthetic data RL framework: given only a formal task description, it autonomously generates high-quality question-answer pairs; incorporates a model-capability-aware, difficulty-adaptive generation mechanism; and dynamically selects training samples for PPO fine-tuning based on average pass rate.
Contribution/Results: The framework eliminates manual annotation entirely, unifying retrieval augmentation, difficulty control, and sample selection into a closed-loop automation pipeline. Experiments demonstrate substantial improvements across diverse benchmarks—+29.2% on GSM8K, +8.7% on MATH, +13.1% on GPQA, +8.9% on MedQA, and +17.7%/+13.7% on legal/financial CQA and CFA datasets—matching the performance of full-scale human-annotated RL fine-tuning. With only 100 human samples, gains are marginal (+0.4 percentage points), confirming drastic reduction in annotation dependency.
📝 Abstract
Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.