PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

๐Ÿ“… 2026-05-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

223K/year
๐Ÿค– AI Summary
This work addresses the challenge of credit assignment in multi-turn complex tasks, where conventional reinforcement learning methods struggle due to reliance on sparse outcome-based rewards, and existing dense reward schemes often incur high costs or require ground-truth labels. The authors propose a prefix-aware reward mechanism leveraging internal states of large language models, combining hidden-state probing with attention features to construct step-level dense rewards without external models or reference answers. They reveal, for the first time, the impact of prefix contamination on internal probing and introduce a two-stage reward model that achieves robustness against contaminated trajectories while maintaining highly accurate reward estimationโ€”all with negligible inference overhead. Experiments demonstrate that the method attains state-of-the-art AUROC on contaminated trajectories, significantly outperforming baselines, and enables efficient multi-turn agent optimization with minimal computational cost.
๐Ÿ“ Abstract
A significant hurdle for current LLMs is the execution of complex, multi-stage tasks. Group Relative Policy Optimization (GRPO) has been emerging as a leading choice, but its reliance on sparse outcome rewards severely limits credit assignment across intermediate steps. Existing remedies such as running full rollouts to assign step-level advantages, calling external LLM judges at each step, or computing intrinsic rewards that require ground-truth answers at every evaluation introduce significant costs or practical constraints. We hypothesize that internal correctness probing over LLM hidden states can be repurposed as a step-level reward signal, potentially addressing all of these limitations at once. However, existing probing research assumes clean inputs, and we first show that this assumption breaks down in multi-step settings: hidden-state probes degrade severely under prefix contamination tracking coherence with the (possibly corrupted) prefix rather than grounded correctness, while attention-based features remain robust to contamination but underperform on clean prefixes. Building on this complementary relationship, we propose the Prefix-Aware Internal Reward (PAIR), a two-stage model with a frozen hidden-state probe estimating belief-consistency and a lightweight attention-based head correcting it toward grounded correctness. Experimental results show that PAIR achieves the highest AUROC on contaminated trajectories while operating at negligible inference cost, enabling dense step-level reward signals for GRPO training without external model calls, ground-truth dependencies, or full-trajectory rollouts.
Problem

Research questions and friction points this paper is trying to address.

multi-turn agent optimization
credit assignment
sparse rewards
internal reward modeling
prefix contamination
Innovation

Methods, ideas, or system contributions that make the work stand out.

internal reward
prefix contamination
multi-turn optimization
GRPO
hidden-state probing