VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

📅 2024-12-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-Language-Action (VLA) benchmarks inadequately assess models’ comprehension of complex linguistic instructions and long-horizon reasoning—particularly lacking systematic evaluation of commonsense transfer, implicit intent recognition, multi-step logical planning, and cross-modal coordination. Method: We introduce VLABench, the first language-conditioned robotic manipulation benchmark explicitly designed for long-horizon reasoning. It comprises 100 highly randomized tasks and 2,000+ diverse objects, uniquely integrating world-knowledge transfer, non-templated natural language instructions, and temporally extended multi-step reasoning. Built upon the VLA paradigm, it leverages automated data collection, high-fidelity simulation, and multi-dimensional joint annotation. Contribution/Results: VLABench enables dual-track evaluation—action policy execution and language understanding—and provides a rigorous, scalable benchmark alongside high-quality fine-tuning data. Experiments reveal substantial performance gaps for current SOTA pre-trained VLAs and VLM-driven pipelines, highlighting critical limitations in embodied reasoning capabilities.

Technology Category

Application Category

📝 Abstract
General-purposed embodied agents are designed to understand the users' natural instructions or intentions and act precisely to complete universal tasks. Recently, methods based on foundation models especially Vision-Language-Action models (VLAs) have shown a substantial potential to solve language-conditioned manipulation (LCM) tasks well. However, existing benchmarks do not adequately meet the needs of VLAs and relative algorithms. To better define such general-purpose tasks in the context of LLMs and advance the research in VLAs, we present VLABench, an open-source benchmark for evaluating universal LCM task learning. VLABench provides 100 carefully designed categories of tasks, with strong randomization in each category of task and a total of 2000+ objects. VLABench stands out from previous benchmarks in four key aspects: 1) tasks requiring world knowledge and common sense transfer, 2) natural language instructions with implicit human intentions rather than templates, 3) long-horizon tasks demanding multi-step reasoning, and 4) evaluation of both action policies and language model capabilities. The benchmark assesses multiple competencies including understanding of mesh&texture, spatial relationship, semantic instruction, physical laws, knowledge transfer and reasoning, etc. To support the downstream finetuning, we provide high-quality training data collected via an automated framework incorporating heuristic skills and prior information. The experimental results indicate that both the current state-of-the-art pretrained VLAs and the workflow based on VLMs face challenges in our tasks.
Problem

Research questions and friction points this paper is trying to address.

Visual-Linguistic-Action Models
Comprehensive Evaluation
Complex Language Instructions
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLABench
Visual-Linguistic-Action Models
Complex Task Execution
🔎 Similar Papers
No similar papers found.
Shiduo Zhang
Shiduo Zhang
Fudan University
Embodied AIFoundation Models
Z
Zhe Xu
School of Computer Science, Fudan University
P
Peiju Liu
School of Computer Science, Fudan University
X
Xiaopeng Yu
School of Computer Science, Fudan University
Y
Yuan Li
School of Computer Science, Fudan University
Q
Qinghui Gao
School of Computer Science, Fudan University
Zhaoye Fei
Zhaoye Fei
Fudan University
Natural Language Processing
Z
Zhangyue Yin
School of Computer Science, Fudan University
Zuxuan Wu
Zuxuan Wu
Fudan University
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI
X
Xipeng Qiu
School of Computer Science, Fudan University