Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the limitation of existing large language model (LLM)-based policy optimization methods, which rely solely on scalar rewards and thus struggle to leverage fine-grained behavioral details from trajectories for precise improvement. To overcome this, the authors propose R2PO, a novel framework that incorporates full trajectories as contextual evidence in LLM-based policy refinement through a two-stage architecture: a Search-LLM generates candidate policies, while a Critic-LLM performs targeted revisions based on state-action-reward trajectories. By decoupling global exploration from behavior-driven local optimization and mitigating saliency bias in the Critic-LLM via median trajectory selection and aggregation-based revision rules, R2PO achieves state-of-the-art average returns across ten environments. Notably, it approaches optimal performance within approximately 500 iterations on tasks such as CartPole, demonstrating markedly superior training stability compared to both deep reinforcement learning baselines and existing LLM-based methods.

📝 Abstract

Existing LLM-based policy optimizers see only scalar rewards: that a policy scored 0.45, but not whether the agent got stuck in a loop, fell into a hole on the third step, or performed well on 19 out of 20 rollouts and failed catastrophically on one. We propose Reflective Prompted Policy Optimization (R2PO), a two-stage LLM framework for policy search over compact policy classes that augments scalar reward feedback with trajectory-level behavioral evidence. A Search-LLM proposes candidate policy parameters; the environment executes them; a Critic-LLM inspects the resulting rollouts and proposes targeted revisions grounded in observed states, actions, and rewards. Across ten environments, ablations show R2PO's gains require separating global search from behavior-grounded revision and using selection to filter high-variance edits. We further identify a dominant failure mode, salience bias: when presented with multiple rollouts, the Critic-LLM fixates on improving a single failure even when most trajectories succeed. In a three-trajectory variant where the Critic-LLM sees the best, worst, and median rollout, this behavior explains 76.6% of regressions on CartPole. R2PO mitigates this by reasoning over aggregate rollout statistics, median-trajectory selection, and a revision rule. Using a 20B open-weight model, R2PO achieves the highest mean best reward across all ten environments, reaches near-optimal performance substantially earlier (e.g., near-maximum CartPole reward within ~500 episodes), and trains far more stably than both deep RL and prior LLM-based methods. These results show that treating trajectories as first-class in-context evidence, rather than artifacts reduced to scalar returns, changes how even comparatively small LLMs search over policy spaces, enabling them to learn faster, diagnose more precisely, and reliably improve external controllers.

Problem

Research questions and friction points this paper is trying to address.

trajectory-level feedback

salience bias

policy optimization

LLM-based reinforcement learning

behavioral evidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

trajectory-grounded revision

salience bias

LLM-based policy optimization