๐ค AI Summary
Open-weight medium-scale language models (e.g., 8B-parameter) suffer from weak verification and self-correction capabilities, limiting their performance on complex mathematical reasoning.
Method: We propose the Deep Self-Evolving Reasoning (DSER) framework, which models multi-step reasoning as a long-horizon Markov chain. DSER employs parallel multi-process sampling, probabilistic chain-of-thought generation, progressive error correction, and majority votingโrequiring only a weak positive improvement bias (i.e., marginal preference for improvement over degradation) to ensure stable convergence, without relying on strong external verification signals.
Contribution/Results: On the AIME 2024โ2025 benchmark, DSER solves 5 of 9 problems previously unsolved by 8B models. Its majority-voted outputs achieve higher accuracy than single-shot inference from a 600B teacher model, revealing the fundamental bottleneck in open-model self-correction and providing a scalable, lightweight solution.
๐ Abstract
Long-form chain-of-thought reasoning has become a cornerstone of advanced reasoning in large language models. While recent verification-refinement frameworks have enabled proprietary models to solve Olympiad-level problems, their effectiveness hinges on strong, reliable verification and correction capabilities, which remain fragile in open-weight, smaller-scale models. This work demonstrates that even with weak verification and refinement capabilities on hard tasks, the reasoning limits of such models can be substantially extended through a probabilistic paradigm we call Deep Self-Evolving Reasoning (DSER). We conceptualize iterative reasoning as a Markov chain, where each step represents a stochastic transition in the solution space. The key insight is that convergence to a correct solution is guaranteed as long as the probability of improvement marginally exceeds that of degradation. By running multiple long-horizon, self-evolving processes in parallel, DSER amplifies these small positive tendencies, enabling the model to asymptotically approach correct answers. Empirically, we apply DSER to the DeepSeek-R1-0528-Qwen3-8B model. On the challenging AIME 2024-2025 benchmark, DSER solves 5 out of 9 previously unsolvable problems and boosts overall performance, enabling this compact model to surpass the single-turn accuracy of its 600B-parameter teacher through majority voting. Beyond its immediate utility for test-time scaling, the DSER framework serves to diagnose the fundamental limitations of current open-weight reasoners. By clearly delineating their shortcomings in self-verification, refinement, and stability, our findings establish a clear research agenda for developing next-generation models with powerful, intrinsic self-evolving capabilities.