Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing RL-based fine-tuning of LLM reasoners discards the value function, preventing value-guided computation scaling during inference. Method: We propose RL$^V$, the first framework to reintegrate the value function into the RL training loop without significant overhead, enabling joint modeling of the reasoner and a generative verifier. Our approach unifies value-guided reinforcement learning on synthetically generated data, generative verification modeling, and parallel/sequential collaborative scaling mechanisms. Contribution/Results: RL$^V$ achieves >20% absolute accuracy gain on MATH, 8–32× inference-time computational efficiency improvement, strong generalization across difficulty levels and domains, and 1.2–1.6× performance gains on long-reasoning R1 models. Its core innovation is a unified, value-driven architecture for reasoning and verification, supporting flexible and efficient test-time computation scaling.

Technology Category

Application Category

📝 Abstract
Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20% with parallel sampling and enables $8-32 imes$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6 imes$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.
Problem

Research questions and friction points this paper is trying to address.

Unifies LLM reasoners with verifiers to enhance test-time scaling
Addresses inefficiency of value-free RL methods in verification tasks
Improves accuracy and compute efficiency in mathematical reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies LLM reasoners with verifiers using RL
Jointly trains LLM as reasoner and verifier
Enables efficient test-time compute scaling
🔎 Similar Papers
No similar papers found.