UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-Language-Action (VLA) models suffer from over-reliance on global image features, weakening spatial detail preservation and physical dynamics modeling—thereby limiting performance on fine-grained localization and interactive manipulation tasks. This work introduces the first unified VLA framework that jointly optimizes vision-language understanding and future action prediction. Leveraging a pre-trained vision-language model (VLM) backbone, our method employs multi-task joint pretraining to simultaneously capture high-level semantics, low-level spatial relationships, and physical dynamics. A novel composite loss integrates cross-modal alignment, action sequence prediction, and explicit spatial relation modeling. Evaluated on the CALVIN ABC-D benchmark, our approach achieves a 33% relative improvement in task success rate. Real-world robotic manipulation experiments demonstrate substantial gains—particularly in tasks demanding precise spatial localization and dynamic interaction—effectively bridging a critical capability gap between static VLMs and embodied agents.

Technology Category

Application Category

📝 Abstract
Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce extbf{UP-VLA}, a extbf{U}nified VLA model training with both multi-modal extbf{U}nderstanding and future extbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.
Problem

Research questions and friction points this paper is trying to address.

VLA model
position information
detail attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

UP-VLA model
future event prediction
precise location information
🔎 Similar Papers
No similar papers found.
Jianke Zhang
Jianke Zhang
Tsinghua University, IIIS
Embodied AI. VLM. Multimodal Learning
Yanjiang Guo
Yanjiang Guo
Tsinghua University
Embodied AIGenerative Model
Y
Yucheng Hu
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
X
Xiaoyu Chen
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China; Shanghai Qi Zhi Institute, Shanghai, China
X
Xiang Zhu
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
Jianyu Chen
Jianyu Chen
Assistant Professor, Tsinghua University
AIRobotics