Long-VLA: Unleashing Long-Horizon Capability of Vision Language Action Model for Robot Manipulation

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language-action (VLA) models exhibit limited performance on long-horizon, multi-step robotic manipulation tasks, primarily due to their inability to model phase-wise dependencies among subtasks and maintain coherence across skill sequences. To address this, we propose a phase-aware input masking mechanism that enables adaptive segmentation of manipulation stages and dynamic focusing on salient perceptual cues. We further introduce the first end-to-end long-horizon VLA model and release L-CALVIN—a dedicated benchmark for systematic evaluation of long-horizon robotic control. Our approach is architecture-agnostic, seamlessly integrating with mainstream VLA frameworks and supporting both simulation and real-world deployment. Experiments demonstrate substantial improvements over state-of-the-art methods on multi-step manipulation tasks, establishing new benchmarks in generalization capability and control robustness for long-horizon robotic operation.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models have become a cornerstone in robotic policy learning, leveraging large-scale multimodal data for robust and scalable control. However, existing VLA frameworks primarily address short-horizon tasks, and their effectiveness on long-horizon, multi-step robotic manipulation remains limited due to challenges in skill chaining and subtask dependencies. In this work, we introduce Long-VLA, the first end-to-end VLA model specifically designed for long-horizon robotic tasks. Our approach features a novel phase-aware input masking strategy that adaptively segments each subtask into moving and interaction phases, enabling the model to focus on phase-relevant sensory cues and enhancing subtask compatibility. This unified strategy preserves the scalability and data efficiency of VLA training, and our architecture-agnostic module can be seamlessly integrated into existing VLA models. We further propose the L-CALVIN benchmark to systematically evaluate long-horizon manipulation. Extensive experiments on both simulated and real-world tasks demonstrate that Long-VLA significantly outperforms prior state-of-the-art methods, establishing a new baseline for long-horizon robotic control.

Problem

Research questions and friction points this paper is trying to address.

Addressing long-horizon multi-step robotic manipulation limitations

Overcoming skill chaining challenges in vision-language-action models

Improving subtask dependency handling for extended task sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Phase-aware input masking strategy

Adaptive segmentation of subtask phases

Architecture-agnostic VLA integration module

🔎 Similar Papers

No similar papers found.

Authors to Follow