Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the inefficiency and high gradient variance in existing large reasoning models trained with reinforcement learning from verifiable rewards, which often rely on computationally expensive baseline estimators or multiple rollouts. The authors propose POISE, a novel method that leverages internal states—such as hidden layer representations and token entropy—from a single forward pass of the policy model to construct lightweight value probes for online return baseline prediction. Requiring only one rollout per update, POISE eliminates the need for auxiliary critic models or repeated sampling while ensuring unbiased gradients through cross-rollout construction. This approach enhances prompt diversity and computational efficiency. Experiments show that POISE achieves performance comparable to DAPO on Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B at lower computational cost, with its value estimator matching the accuracy of standalone LLM-scale critics and demonstrating strong generalization across diverse verifiable tasks.

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning

Value Estimation

Large Reasoning Models

Baseline Estimation

Gradient Variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Internal State Value Estimation

Reinforcement Learning with Verifiable Rewards

Cross-Rollout Baseline