Deep Self-Evolving Reasoning

๐Ÿ“… 2025-10-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Open-weight medium-scale language models (e.g., 8B-parameter) suffer from weak verification and self-correction capabilities, limiting their performance on complex mathematical reasoning. Method: We propose the Deep Self-Evolving Reasoning (DSER) framework, which models multi-step reasoning as a long-horizon Markov chain. DSER employs parallel multi-process sampling, probabilistic chain-of-thought generation, progressive error correction, and majority votingโ€”requiring only a weak positive improvement bias (i.e., marginal preference for improvement over degradation) to ensure stable convergence, without relying on strong external verification signals. Contribution/Results: On the AIME 2024โ€“2025 benchmark, DSER solves 5 of 9 problems previously unsolved by 8B models. Its majority-voted outputs achieve higher accuracy than single-shot inference from a 600B teacher model, revealing the fundamental bottleneck in open-model self-correction and providing a scalable, lightweight solution.

Technology Category

Application Category

๐Ÿ“ Abstract
Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.
Problem

Research questions and friction points this paper is trying to address.

Extends reasoning limits of small models with weak verification capabilities
Guarantees solution convergence through probabilistic improvement over degradation
Diagnoses fundamental limitations in self-verification and refinement of open-weight reasoners
Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic paradigm extends reasoning with weak verification
Multiple parallel self-evolving processes amplify improvement tendencies
Markov chain conceptualization guarantees convergence through stochastic transitions
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zihan Liu
Peking University
Shun Zheng
Shun Zheng
Microsoft Research Asia
LLM ReasoningAI for Industry
Xumeng Wen
Xumeng Wen
MSRA
Y
Yang Wang
Microsoft Research Asia
J
Jiang Bian
Microsoft Research Asia
M
Mao Yang
Microsoft Research Asia