MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning

📅 2025-10-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Test-time scaling often incurs prohibitive computational costs, making it difficult to balance performance and efficiency. Method: This paper proposes a recursive reasoning-verification-summarization framework that iteratively leverages the model’s own capabilities to refine and correct multi-step reasoning processes. Contribution/Results: The method significantly reduces computational overhead while maintaining or even improving solution accuracy: it achieves 99.79 on AIME2025 using only 4% of DeepConf’s FLOPs. It substantially narrows the gap between Pass@k and Pass@1, enhancing reasoning consistency and robustness. The framework demonstrates strong generalizability—validated across multiple open-source language models and multimodal benchmarks. By enabling efficient, self-correcting inference, it establishes a novel paradigm for scalable, reliable test-time adaptation.

Technology Category

Application Category

📝 Abstract
Test-time scaling has emerged as a promising paradigm in language modeling, wherein additional computational resources are allocated during inference to enhance model performance. Recent approaches, such as DeepConf, have demonstrated the efficacy of this strategy, however, they often incur substantial computational overhead to achieve competitive results. In this work, we propose MatryoshkaThinking, a novel method that significantly reduces computational cost while maintaining state-of-the-art performance. Specifically, MatryoshkaThinking attains a score of 99.79 on AIME2025 using only 4% of the computation required by DeepConf. The core of our approach lies in the recursive exploitation of the model's intrinsic capabilities in reasoning, verification, and summarization, which collectively enhance the retention of correct solutions and reduce the disparity between Pass@k and Pass@1. Comprehensive evaluations across multiple open-source models and challenging multi-modal reasoning benchmarks validate the effectiveness and generality of our method. These findings offer new insights into the design of efficient and scalable test-time inference strategies for advanced language models.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational cost while maintaining top performance
Recursively exploits reasoning, verification, and summarization capabilities
Enhances solution retention and reduces Pass@k vs Pass@1 disparity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive test-time scaling reduces computational costs
Exploits model's reasoning, verification, and summarization capabilities
Enhances solution retention and reduces Pass@k disparity
🔎 Similar Papers
No similar papers found.
H
Hongwei Chen
ERNIE Team, Baidu
Y
Yishu Lei
ERNIE Team, Baidu
D
Dan Zhang
ERNIE Team, Baidu
B
Bo Ke
ERNIE Team, Baidu
D
Danxiang Zhu
ERNIE Team, Baidu
X
Xuyi Chen
ERNIE Team, Baidu
Y
Yuxiang Lu
ERNIE Team, Baidu
Zhengjie Huang
Zhengjie Huang
Baidu Inc
Vision Language ModelLarge Language ModelsGraph Neural NetworkNatural Language Processing
Shikun Feng
Shikun Feng
Baidu
nlp
J
Jingzhou He
ERNIE Team, Baidu
Y
Yu Sun
ERNIE Team, Baidu
H
Hua Wu
ERNIE Team, Baidu
H
Haifeng Wang
ERNIE Team, Baidu