RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
Existing vision-language action models struggle with execution drift in long-horizon, high-contact manipulation tasks due to their reliance solely on successful demonstrations, while discarding failure trajectories undermines robustness. This work proposes a recovery-driven policy optimization framework that, for the first time, systematically incorporates recovery trajectory modeling. By integrating Recovery-Aware Initialization (RAI), a Progress-Aware Semantic Value Function (PAS-VF), and Value-Conditioned Refinement (VCR), the approach transforms adverse states into corrective training signals and steers policy learning toward actions that maximize task progress. Notably, the method operates without online failure detection and significantly enhances disturbance resilience and self-recovery capability in both simulated and real-world bimanual tasks. In adversarial scenarios, success rates improve from 20% to an average of 75%, reaching up to 80% in physical experiments.
📝 Abstract
Vision-Language-Action (VLA) models remain brittle in long-horizon, contact-rich manipulation because success-only imitation provides little supervision for execution drift, while failed rollouts are often discarded. We introduce RePO-VLA, a recovery-driven policy optimization framework that assigns distinct roles to success, recovery, and failure trajectories. RePO-VLA first applies Recovery-Aware Initialization (RAI), slicing recovery segments and resetting history so corrective actions depend on the current adverse state rather than the preceding failure. It then learns a Progress-Aware Semantic Value Function (PAS-VF), aligning spatiotemporal trajectory features with instructions and successful references. The resulting labels salvage useful failure prefixes via reliability decay, while low-value labels mark drift and terminal breakdowns, teaching differences among nominal, failed, and corrective actions. The data engine turns adverse states into planner-generated or human-collected corrective rollouts, teaching recovery to the success manifold. Value-Conditioned Refinement (VCR) trains the policy to prefer high-progress actions. At deployment, a fixed high value ($v=1.0$) biases actions toward the learned success manifold without online failure detectors or heuristic retries. We introduce FRBench, with standardized error injection and recovery-focused evaluation. Across simulated and real-world bimanual tasks, RePO-VLA improves robustness, raising adversarial success from 20% to 75% on average and up to 80% in scaled real-world trials.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
execution drift
failure recovery
long-horizon manipulation
contact-rich tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recovery-Driven Policy Optimization
Vision-Language-Action Models
Progress-Aware Semantic Value Function
Recovery-Aware Initialization
Value-Conditioned Refinement