DARC: Decoupled Asymmetric Reasoning Curriculum for LLM Evolution

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the instability commonly observed in self-play training of large language models, which often stems from reliance on model-generated pseudo-labels and reward signals. To mitigate this issue, the authors propose a two-stage asymmetric self-play framework: first, a questioner module is trained to generate questions of controllable difficulty via a difficulty calibration mechanism; then, a document-augmented teacher model—equipped with external knowledge—produces high-quality pseudo-labels for a student solver that lacks document access, enabling asymmetric self-distillation. By decoupling question generation from problem solving and integrating controllable question difficulty with document-augmented supervision, the approach significantly enhances the stability of self-evolution. Evaluated across nine reasoning benchmarks and three backbone models, it achieves an average improvement of 10.9 points, approaching the performance of fully supervised models without requiring any human annotations.

Technology Category

Application Category

📝 Abstract

Self-play with large language models has emerged as a promising paradigm for achieving self-improving artificial intelligence. However, existing self-play frameworks often suffer from optimization instability, due to (i) non-stationary objectives induced by solver-dependent reward feedback for the Questioner, and (ii) bootstrapping errors from self-generated pseudo-labels used to supervise the Solver. To mitigate these challenges, we introduce DARC (Decoupled Asymmetric Reasoning Curriculum), a two-stage framework that stabilizes the self-evolution process. First, we train the Questioner to synthesize difficulty-calibrated questions, conditioned on explicit difficulty levels and external corpora. Second, we train the Solver with an asymmetric self-distillation mechanism, where a document-augmented teacher generates high-quality pseudo-labels to supervise the student Solver that lacks document access. Empirical results demonstrate that DARC is model-agnostic, yielding an average improvement of 10.9 points across nine reasoning benchmarks and three backbone models. Moreover, DARC consistently outperforms all baselines and approaches the performance of fully supervised models without relying on human annotations. The code is available at https://github.com/RUCBM/DARC.

Problem

Research questions and friction points this paper is trying to address.

self-play

optimization instability

non-stationary objectives

bootstrapping errors

pseudo-labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-play

asymmetric self-distillation

difficulty-calibrated question generation