🤖 AI Summary
Large language models (LLMs) face a critical bottleneck in reasoning enhancement—overreliance on scarce, costly external annotated data and limited capacity for autonomous evolution.
Method: We propose DTE, an unsupervised multi-agent debate-driven self-evolution framework. DTE eliminates dependence on ground-truth labels and introduces a novel self-evolution paradigm grounded in debate trajectories. It employs a Reflect-Critique-Refine prompting strategy to explicitly model critical reflection and iterative refinement, integrating multi-agent debate, self-reflection prompting, debate-trajectory distillation training, and zero-shot transfer.
Contribution/Results: On GSM-PLUS, DTE achieves an 8.92% absolute accuracy gain; across five additional reasoning benchmarks, it yields an average improvement of 5.8%. Notably, it significantly enhances cross-domain generalization. DTE establishes a new paradigm for unsupervised reasoning optimization in LLMs, advancing autonomous capability growth without human supervision.
📝 Abstract
Large language models (LLMs) have improved significantly in their reasoning through extensive training on massive datasets. However, relying solely on additional data for improvement is becoming increasingly impractical, highlighting the need for models to autonomously enhance their reasoning without external supervision. In this paper, we propose Debate, Train, Evolve (DTE), a novel ground truth-free training framework that uses multi-agent debate traces to evolve a single language model. We also introduce a new prompting strategy Reflect-Critique-Refine, to improve debate quality by explicitly instructing agents to critique and refine their reasoning. Extensive evaluations on five reasoning benchmarks with six open-weight models show that our DTE framework achieve substantial improvements, with an average accuracy gain of 8.92% on the challenging GSM-PLUS dataset. Furthermore, we observe strong cross-domain generalization, with an average accuracy gain of 5.8% on all other benchmarks, suggesting that our method captures general reasoning capabilities.