🤖 AI Summary
Existing Vision-Language-Action (VLA) models suffer from over-reliance on global image features, weakening spatial detail preservation and physical dynamics modeling—thereby limiting performance on fine-grained localization and interactive manipulation tasks. This work introduces the first unified VLA framework that jointly optimizes vision-language understanding and future action prediction. Leveraging a pre-trained vision-language model (VLM) backbone, our method employs multi-task joint pretraining to simultaneously capture high-level semantics, low-level spatial relationships, and physical dynamics. A novel composite loss integrates cross-modal alignment, action sequence prediction, and explicit spatial relation modeling. Evaluated on the CALVIN ABC-D benchmark, our approach achieves a 33% relative improvement in task success rate. Real-world robotic manipulation experiments demonstrate substantial gains—particularly in tasks demanding precise spatial localization and dynamic interaction—effectively bridging a critical capability gap between static VLMs and embodied agents.
📝 Abstract
Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce extbf{UP-VLA}, a extbf{U}nified VLA model training with both multi-modal extbf{U}nderstanding and future extbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.