🤖 AI Summary
This work proposes a closed-loop vision-language-action framework to enhance robots’ awareness of task progress and error recovery in complex manipulation scenarios. The approach introduces, for the first time, an explicit progress-aware mechanism that continuously monitors the current state relative to the next milestone, leveraging a sequence of spatial subgoals and a fallback strategy to enable dynamic task execution and autonomous error correction—without requiring additional training data or auxiliary models. By integrating state observation, language instruction parsing, 2D path planning, and progress monitoring into a unified architecture, the framework achieves a 5% performance gain over MolmoAct on the LIBERO benchmark and demonstrates state-of-the-art out-of-distribution robustness on the more challenging LIBERO-Plus, exhibiting the smallest performance degradation among existing methods.
📝 Abstract
Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.