🤖 AI Summary
Existing RL-based fine-tuning of LLM reasoners discards the value function, preventing value-guided computation scaling during inference. Method: We propose RL$^V$, the first framework to reintegrate the value function into the RL training loop without significant overhead, enabling joint modeling of the reasoner and a generative verifier. Our approach unifies value-guided reinforcement learning on synthetically generated data, generative verification modeling, and parallel/sequential collaborative scaling mechanisms. Contribution/Results: RL$^V$ achieves >20% absolute accuracy gain on MATH, 8–32× inference-time computational efficiency improvement, strong generalization across difficulty levels and domains, and 1.2–1.6× performance gains on long-reasoning R1 models. Its core innovation is a unified, value-driven architecture for reasoning and verification, supporting flexible and efficient test-time computation scaling.
📝 Abstract
Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20% with parallel sampling and enables $8-32 imes$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6 imes$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.