🤖 AI Summary
Large language models struggle to self-improve without external feedback or additional training, primarily due to difficulties in generating high-quality candidate solutions and selecting correct answers in an unsupervised setting. This work proposes the Test-time Recursive Thinking (TRT) framework, which enables iterative self-optimization during inference through strategy-guided reasoning, knowledge accumulation, and self-generated validation signals. TRT achieves effective self-improvement for the first time without relying on reinforcement learning or human annotations, establishing an end-to-end test-time optimization pipeline. Experimental results demonstrate that open-source models attain 100% accuracy on AIME-25/24, while closed-source models show performance gains of 10.4–14.8 percentage points on the most challenging problems in LiveCodeBench.
📝 Abstract
Modern Large Language Models (LLMs) have shown rapid improvements in reasoning capabilities, driven largely by reinforcement learning (RL) with verifiable rewards. Here, we ask whether these LLMs can self-improve without the need for additional training. We identify two core challenges for such systems: (i) efficiently generating diverse, high-quality candidate solutions, and (ii) reliably selecting correct answers in the absence of ground-truth supervision. To address these challenges, we propose Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals. Using TRT, open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench's most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.