PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
Existing vision-language-action (VLA) models struggle to ensure policy reliability in closed-loop interaction due to the absence of explicit modeling of critical planning actions. This work proposes decoupling VLA policies into a planner and an executor, thereby explicitly identifying planning actions for the first time. By leveraging causal inference, the framework quantifies the sufficiency and necessity of these actions and integrates their importance into the GRPO advantage estimation to reinforce optimization of key decisions. Evaluated across multiple benchmark tasks, the approach significantly improves both the success rate and robustness of language-guided robotic manipulation, demonstrating the effectiveness of the proposed architecture.
📝 Abstract
Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy Optimization for VLA models (PAPO-VLA). PAPO-VLA first identifies planning actions by jointly considering action variation and trajectory outcome, then estimates their importance through causal sufficiency and causal necessity, and finally incorporates this importance into GRPO advantage estimation. In this way, more important planning actions receive stronger optimization emphasis, while the whole trajectory is still optimized by trajectory-level feedback. Experiments on multiple benchmarks demonstrate the effectiveness of PAPO-VLA.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
planning actions
policy reliability
robotic manipulation
closed-loop interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Planning-Aware Policy Optimization
Vision-Language-Action Models
Causal Necessity and Sufficiency
GRPO Advantage Estimation
Planning Actions