Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

πŸ“… 2025-05-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language-action (VLA) models exhibit unclear zero-shot cross-task generalization capability on unseen manipulation tasks, hindering the development of general-purpose robotic manipulation. This paper addresses this challenge by proposing AGNOSTOSβ€”a rigorous, first-of-its-kind benchmark specifically designed for evaluating zero-shot cross-task generalization across 23 unseen manipulation tasks. We further introduce X-ICM, a novel method integrating dynamics-aware contextual demonstration selection with an action generation paradigm that enables LLM-conditioned guidance, joint vision-language-action modeling, and explicit dynamics alignment. In simulation-based multi-task evaluation, X-ICM achieves an average task success rate 37.2% higher than state-of-the-art VLA models, empirically validating the critical role of dynamics-aware modeling and contextual guidance in enabling robust cross-task generalization.

Technology Category

Application Category

πŸ“ Abstract
The generalization capabilities of vision-language-action (VLA) models to unseen tasks are crucial to achieving general-purpose robotic manipulation in open-world settings. However, the cross-task generalization capabilities of existing VLA models remain significantly underexplored. To address this gap, we introduce AGNOSTOS, a novel simulation benchmark designed to rigorously evaluate cross-task zero-shot generalization in manipulation. AGNOSTOS comprises 23 unseen manipulation tasks for testing, distinct from common training task distributions, and incorporates two levels of generalization difficulty to assess robustness. Our systematic evaluation reveals that current VLA models, despite being trained on diverse datasets, struggle to generalize effectively to these unseen tasks. To overcome this limitation, we propose Cross-Task In-Context Manipulation (X-ICM), a method that conditions large language models (LLMs) on in-context demonstrations from seen tasks to predict action sequences for unseen tasks. Additionally, we introduce a dynamics-guided sample selection strategy that identifies relevant demonstrations by capturing cross-task dynamics. On AGNOSTOS, X-ICM significantly improves cross-task zero-shot generalization performance over leading VLAs. We believe AGNOSTOS and X-ICM will serve as valuable tools for advancing general-purpose robotic manipulation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating cross-task generalization in vision-language-action models
Addressing poor generalization to unseen manipulation tasks
Proposing methods to improve zero-shot task adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AGNOSTOS benchmark for cross-task generalization
Proposes X-ICM method using in-context task demonstrations
Uses dynamics-guided strategy for relevant sample selection
πŸ”Ž Similar Papers
No similar papers found.
J
Jiaming Zhou
The Hong Kong University of Science and Technology (Guangzhou)
K
Ke Ye
The Hong Kong University of Science and Technology (Guangzhou)
J
Jiayi Liu
The Hong Kong University of Science and Technology (Guangzhou)
Teli Ma
Teli Ma
HKUST(GZ) | Shanghai AI Laboratory
Computer VisionVision-LanguageRobotics
Z
Zifang Wang
The Hong Kong University of Science and Technology (Guangzhou)
Ronghe Qiu
Ronghe Qiu
Hong Kong University of Science and Technology
Embodied AIMobile Manipulation
Kun-Yu Lin
Kun-Yu Lin
The University of Hong Kong
Computer VisionMachine Learning
Z
Zhilin Zhao
Sun Yat-sen University
Junwei Liang
Junwei Liang
Assistant Professor, HKUST (Guangzhou) | CSE, HKUST | Ph.D. @CMU
Computer VisionRoboticsEmbodied AITrajectory Prediction