🤖 AI Summary
This work proposes Anticipation-VLA, a hierarchical vision-language-action (VLA) architecture designed to overcome the limitations of existing VLA models in long-horizon embodied tasks, where error accumulation and fixed-granularity subtask decomposition hinder adaptability to dynamic execution states. Anticipation-VLA introduces a recursive, adaptive anticipation mechanism that dynamically generates future subgoals, coupled with a unified multimodal model (UMM) fine-tuned for high-level subgoal planning. This planning module operates in concert with a goal-conditioned VLA policy responsible for low-level action execution, forming an end-to-end adaptive control framework. Evaluated in both simulated and real-world robotic tasks, the approach significantly outperforms current VLA models, demonstrating that dynamic subgoal generation is crucial for enhancing robustness in long-horizon task execution.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for embodied intelligence, enabling robots to perform tasks based on natural language instructions and current visual input. However, existing VLA models struggle with long-horizon tasks due to compounding errors. Prior methods decompose tasks into subtasks of fixed granularity, which cannot adapt to the varying complexity of execution states, limiting their robustness in long-horizon tasks. To overcome this, we introduce Anticipation Model, which adaptively and recursively generates future subgoals. This model continuously adapts as the task unfolds, adjusting future subgoals in response to evolving dynamics, facilitating more reliable planning paths. Building on this concept, we propose Anticipation-VLA, a hierarchical VLA model that leverages the anticipation model to generate actionable subgoals that guide VLA policy execution. We implement Anticipation-VLA with finetuning a Unified Multimodal Model (UMM) for high-level subgoal generation and a goal-conditioned VLA policy for low-level action execution. Experiments in both simulated and real-world robotic tasks demonstrate the effectiveness of Anticipation-VLA, highlighting the importance of adaptive and recursive subgoal generation for robust policy execution.