Anticipation-VLA: Solving Long-Horizon Embodied Tasks via Anticipation-based Subgoal Generation

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
This work proposes Anticipation-VLA, a hierarchical vision-language-action (VLA) architecture designed to overcome the limitations of existing VLA models in long-horizon embodied tasks, where error accumulation and fixed-granularity subtask decomposition hinder adaptability to dynamic execution states. Anticipation-VLA introduces a recursive, adaptive anticipation mechanism that dynamically generates future subgoals, coupled with a unified multimodal model (UMM) fine-tuned for high-level subgoal planning. This planning module operates in concert with a goal-conditioned VLA policy responsible for low-level action execution, forming an end-to-end adaptive control framework. Evaluated in both simulated and real-world robotic tasks, the approach significantly outperforms current VLA models, demonstrating that dynamic subgoal generation is crucial for enhancing robustness in long-horizon task execution.
📝 Abstract
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for embodied intelligence, enabling robots to perform tasks based on natural language instructions and current visual input. However, existing VLA models struggle with long-horizon tasks due to compounding errors. Prior methods decompose tasks into subtasks of fixed granularity, which cannot adapt to the varying complexity of execution states, limiting their robustness in long-horizon tasks. To overcome this, we introduce Anticipation Model, which adaptively and recursively generates future subgoals. This model continuously adapts as the task unfolds, adjusting future subgoals in response to evolving dynamics, facilitating more reliable planning paths. Building on this concept, we propose Anticipation-VLA, a hierarchical VLA model that leverages the anticipation model to generate actionable subgoals that guide VLA policy execution. We implement Anticipation-VLA with finetuning a Unified Multimodal Model (UMM) for high-level subgoal generation and a goal-conditioned VLA policy for low-level action execution. Experiments in both simulated and real-world robotic tasks demonstrate the effectiveness of Anticipation-VLA, highlighting the importance of adaptive and recursive subgoal generation for robust policy execution.
Problem

Research questions and friction points this paper is trying to address.

long-horizon tasks
Vision-Language-Action models
subgoal generation
embodied intelligence
compounding errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Anticipation Model
adaptive subgoal generation
recursive planning
hierarchical VLA
long-horizon embodied tasks
Zhilong Zhang
Zhilong Zhang
Nanjing University
Reinforcement LearningDeep Learning
W
Wenyu Luo
School of Artificial Intelligence, Nanjing University, Nanjing, China
Haonan Wang
Haonan Wang
PhD Student, School of Computing, National University of Singapore
Machine LearningGenerative AIData-Centric AIData Mining
Y
Yifei Sheng
School of Artificial Intelligence, Nanjing University, Nanjing, China
Y
Yidi Wang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Artificial Intelligence, Nanjing University, Nanjing, China
H
Hanyuan Guo
School of Artificial Intelligence, Nanjing University, Nanjing, China
H
Haoxiang Ren
School of Artificial Intelligence, Nanjing University, Nanjing, China
X
Xinghao Du
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China; School of Artificial Intelligence, Nanjing University, Nanjing, China
Y
Yuhan Che
Department of Foundation model, 2012 Labs, Huawei
Tongtong Cao
Tongtong Cao
Researcher, Huawei Noah's Ark Lab
RoboticsEmbodied AIAutonomous driving
Lei Yuan
Lei Yuan
Nanjing University
Machine LearningReinforcement LearningMulti-Agent SystemsEmbodied AI
Yang Yu
Yang Yu
Professor, Nanjing University
Artificial IntelligenceReinforcement LearningEvolutionary Algorithms