UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing Vision-Language-Action (VLA) models suffer from over-reliance on global image features, weakening spatial detail preservation and physical dynamics modeling—thereby limiting performance on fine-grained localization and interactive manipulation tasks. This work introduces the first unified VLA framework that jointly optimizes vision-language understanding and future action prediction. Leveraging a pre-trained vision-language model (VLM) backbone, our method employs multi-task joint pretraining to simultaneously capture high-level semantics, low-level spatial relationships, and physical dynamics. A novel composite loss integrates cross-modal alignment, action sequence prediction, and explicit spatial relation modeling. Evaluated on the CALVIN ABC-D benchmark, our approach achieves a 33% relative improvement in task success rate. Real-world robotic manipulation experiments demonstrate substantial gains—particularly in tasks demanding precise spatial localization and dynamic interaction—effectively bridging a critical capability gap between static VLMs and embodied agents.

Technology Category

Application Category

📝 Abstract

Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce extbf{UP-VLA}, a extbf{U}nified VLA model training with both multi-modal extbf{U}nderstanding and future extbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.

Problem

Research questions and friction points this paper is trying to address.

VLA model

position information

detail attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

UP-VLA model

future event prediction

precise location information

🔎 Similar Papers

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI