Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation

πŸ“… 2025-08-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language-action (VLA) models exhibit limited performance on long-horizon, multi-step robotic manipulation tasks, primarily due to their inability to model phase-wise dependencies among subtasks and maintain coherence across skill sequences. To address this, we propose a phase-aware input masking mechanism that enables adaptive segmentation of manipulation stages and dynamic focusing on salient perceptual cues. We further introduce the first end-to-end long-horizon VLA model and release L-CALVINβ€”a dedicated benchmark for systematic evaluation of long-horizon robotic control. Our approach is architecture-agnostic, seamlessly integrating with mainstream VLA frameworks and supporting both simulation and real-world deployment. Experiments demonstrate substantial improvements over state-of-the-art methods on multi-step manipulation tasks, establishing new benchmarks in generalization capability and control robustness for long-horizon robotic operation.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.
Problem

Research questions and friction points this paper is trying to address.

Addressing long-horizon multi-step robotic manipulation limitations
Overcoming skill chaining challenges in vision-language-action models
Improving subtask dependency handling for extended task sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Phase-aware input masking strategy
Adaptive segmentation of subtask phases
Architecture-agnostic VLA integration module
πŸ”Ž Similar Papers
No similar papers found.
Y
Yiguo Fan
Westlake University
Pengxiang Ding
Pengxiang Ding
Zhejiang University
Human Motion PredictionLarge Language ModelEmbodied AI
Shuanghao Bai
Shuanghao Bai
Xi'an Jiao Tong University Phd student
Vision Language ModelsDomain AdaptationDomain GeneralizationRobotic Manipulation
X
Xinyang Tong
Westlake University
Y
Yuyang Zhu
Westlake University
H
Hongchao Lu
Zhejiang University
F
Fengqi Dai
Zhejiang University
W
Wei Zhao
Zhejiang University
Y
Yang Liu
Zhejiang University
Siteng Huang
Siteng Huang
Alibaba DAMO Academy | ZJU | Westlake University
Vision-language ModelsGenerative ModelsEmbodied AI
Z
Zhaoxin Fan
Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing
Badong Chen
Badong Chen
Professor of Xi'an Jiaotong University, Xi'an, China
signal processingmachine learningbrain machine interfacesrobotics
D
Donglin Wang
University of Electronic Science and Technology of China