$\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving

📅 2026-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models often suffer performance degradation during reasoning due to entrapment in redundant pathways stemming from low-quality initial chains of thought, coupled with an inability to self-correct effectively. To address this, this work proposes Re², a novel approach that, for the first time, employs a purely reinforcement learning–based framework—without supervised fine-tuning—to dynamically control the reasoning process via a verifiable reward mechanism. This enables the model to learn to proactively abandon unproductive reasoning paths and restart problem-solving. Re² dramatically increases the retry behavior rate from 0.5% to over 30%, surpassing the limitations of conventional RLVR (Reinforcement Learning with Verifiable Rewards). Under identical training compute budgets, Re² substantially outperforms standard RLVR and exhibits consistent performance gains as the number of test-time samples increases, demonstrating the effectiveness and scalability of its dynamic retry mechanism.

Technology Category

Application Category

📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning performance of large language models (LLMs) by increasing test-time compute. However, even after extensive RLVR training, such models still tend to generate unnecessary and low-quality steps in their chain-of-thought (CoT), leading to inefficient overthinking and lower answer quality. We show that when the initial direction or quality of the CoT is suboptimal, the model often fails to reach the correct answer, even after generating several times more tokens than when the initial CoT is well-initialized. To this end, we introduce Reinforcement Learning with Re-solving (Re$^2$), in which LLMs learn to flexibly abandon unproductive reasoning paths and restart the solution process when necessary, rather than always committing to a final answer. Re$^2$ applies pure reinforcement learning without any preliminary supervised fine-tuning, successfully amplifying the rare redo behavior in vanilla models from only 0.5% to over 30%. This leads to substantial performance gains over standard RLVR under the same training compute budget, and also demonstrates notable improvements in test-time performance as the number of samples increases.
Problem

Research questions and friction points this paper is trying to address.

large language models
chain-of-thought
reasoning efficiency
overthinking
answer quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning with Re-solving
Chain-of-Thought Reasoning
Large Language Models
Test-time Redo Behavior
Verifiable Rewards
🔎 Similar Papers
No similar papers found.