Deep Self-Evolving Reasoning

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Open-weight medium-scale language models (e.g., 8B-parameter) suffer from weak verification and self-correction capabilities, limiting their performance on complex mathematical reasoning. Method: We propose the Deep Self-Evolving Reasoning (DSER) framework, which models multi-step reasoning as a long-horizon Markov chain. DSER employs parallel multi-process sampling, probabilistic chain-of-thought generation, progressive error correction, and majority voting—requiring only a weak positive improvement bias (i.e., marginal preference for improvement over degradation) to ensure stable convergence, without relying on strong external verification signals. Contribution/Results: On the AIME 2024–2025 benchmark, DSER solves 5 of 9 problems previously unsolved by 8B models. Its majority-voted outputs achieve higher accuracy than single-shot inference from a 600B teacher model, revealing the fundamental bottleneck in open-model self-correction and providing a scalable, lightweight solution.

Technology Category

Application Category

📝 Abstract

Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.

Problem

Research questions and friction points this paper is trying to address.

Extends reasoning limits of small models with weak verification capabilities

Guarantees solution convergence through probabilistic improvement over degradation

Diagnoses fundamental limitations in self-verification and refinement of open-weight reasoners

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic paradigm extends reasoning with weak verification

Multiple parallel self-evolving processes amplify improvement tendencies

Markov chain conceptualization guarantees convergence through stochastic transitions

🔎 Similar Papers

Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement