DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Large language models (LLMs) face a critical bottleneck in reasoning enhancement—overreliance on scarce, costly external annotated data and limited capacity for autonomous evolution. Method: We propose DTE, an unsupervised multi-agent debate-driven self-evolution framework. DTE eliminates dependence on ground-truth labels and introduces a novel self-evolution paradigm grounded in debate trajectories. It employs a Reflect-Critique-Refine prompting strategy to explicitly model critical reflection and iterative refinement, integrating multi-agent debate, self-reflection prompting, debate-trajectory distillation training, and zero-shot transfer. Contribution/Results: On GSM-PLUS, DTE achieves an 8.92% absolute accuracy gain; across five additional reasoning benchmarks, it yields an average improvement of 5.8%. Notably, it significantly enhances cross-domain generalization. DTE establishes a new paradigm for unsupervised reasoning optimization in LLMs, advancing autonomous capability growth without human supervision.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on five reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning without external supervision

Developing autonomous model evolution via multi-agent debate

Improving cross-domain generalization of reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent debate traces for model evolution

Reflect-Critique-Refine prompting strategy

Ground truth-free training framework

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning