PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) lack fine-grained understanding of physical preconditions—such as object properties, action affordances, and physical constraints—hindering their reliable deployment in robotic manipulation tasks. To address this, we introduce PAC Bench, the first large-scale evaluation benchmark specifically designed to assess VLMs’ capability to reason about preconditions for robotic manipulation. PAC Bench features real-world images, humanoid-perspective scenes, and simulated physical constraints, built upon over 30,000 manually annotated, multi-type instances. It systematically evaluates VLMs’ physical commonsense reasoning from the perspective of task executability. Empirical results reveal significant deficiencies in mainstream VLMs’ comprehension of fundamental physical concepts, substantiating their limitations in real-world robot control. PAC Bench thus establishes a standardized, scalable, and physically grounded evaluation framework to advance research on physics-aware vision-language understanding for robotics.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that remains largely unverified. For robots to perform actions reliably, they must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object's state, such as being closed). Despite the widespread use of VLMs in manipulation tasks, we argue that off-the-shelf models may lack this granular, physically grounded understanding, as such prerequisites are often overlooked during training. To address this critical gap, we introduce PAC Bench, a comprehensive benchmark designed to systematically evaluate VLMs on their understanding of core Properties, Affordances, and Constraints (PAC) from a task executability perspective. PAC Bench features a diverse dataset with over 30,000 annotations, comprising 673 real-world images (115 object classes, 15 property types, and 1 to 3 affordances defined per class), 100 real-world humanoid-view scenarios, and 120 unique simulated constraint scenarios across four tasks. Our evaluations reveal significant gaps in the ability of current VLMs to grasp fundamental physical concepts, highlighting limitations in their suitability for reliable robot manipulation and pointing to key areas for targeted research. PAC Bench also serves as a standardized benchmark for rigorously evaluating physical reasoning in VLMs and guiding the development of more robust, physically grounded models for robotic applications. Project Page: https://pacbench.github.io/
Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' understanding of low-level physical prerequisites for manipulation.
Evaluating VLMs' grasp of object properties, affordances, and constraints.
Identifying gaps in VLMs' physical reasoning for reliable robot tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PAC Bench for VLM evaluation
Evaluates Properties, Affordances, and Constraints
Uses diverse dataset with 30,000 annotations