DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability commonly observed in self-play training of large language models, which often stems from reliance on model-generated pseudo-labels and reward signals. To mitigate this issue, the authors propose a two-stage asymmetric self-play framework: first, a questioner module is trained to generate questions of controllable difficulty via a difficulty calibration mechanism; then, a document-augmented teacher model—equipped with external knowledge—produces high-quality pseudo-labels for a student solver that lacks document access, enabling asymmetric self-distillation. By decoupling question generation from problem solving and integrating controllable question difficulty with document-augmented supervision, the approach significantly enhances the stability of self-evolution. Evaluated across nine reasoning benchmarks and three backbone models, it achieves an average improvement of 10.9 points, approaching the performance of fully supervised models without requiring any human annotations.

Technology Category

Application Category

📝 Abstract
Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations. The code is available at https://github.com/RUCBM/DARC.
Problem

Research questions and friction points this paper is trying to address.

self-play
optimization instability
non-stationary objectives
bootstrapping errors
pseudo-labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-play
asymmetric self-distillation
difficulty-calibrated question generation
pseudo-label supervision
model-agnostic self-evolution
🔎 Similar Papers
No similar papers found.
S
Shengda Fan
Gaoling School of Artificial Intelligence, Renmin University of China
X
Xuyan Ye
Gaoling School of Artificial Intelligence, Renmin University of China
Yankai Lin
Yankai Lin
Associate Professor (Tenure Track), Gaoling School of AI, Renmin University of China
Natural Language ProcessingLarge Language Models