🤖 AI Summary
Multimodal large language models (MLLMs) exhibit critical deficiencies in physical tool cognition—specifically in tool recognition, understanding, and creation—posing fundamental challenges for embodied intelligence.
Method: We introduce ToolBench, the first vision-language benchmark explicitly designed to evaluate these three hierarchical capabilities, comprising over 1,000 image–question pairs. We propose a novel “tool creation” task demanding both causal principle reasoning and compositional generalization, and systematically assess 32 state-of-the-art MLLMs—including closed-source, open-source, and embodied-specialized variants.
Contribution/Results: Our evaluation reveals severe limitations in MLLMs’ grasp of tool mechanics and creative application, highlighting bottlenecks in physical causal reasoning and embodiment-aware adaptation. ToolBench fills a key gap in quantitative tool cognition assessment; we publicly release the dataset, annotations, and evaluation code to establish a reproducible benchmark and actionable roadmap for advancing tool learning in embodied AI.
📝 Abstract
The ability to use, understand, and create tools is a hallmark of human intelligence, enabling sophisticated interaction with the physical world. For any general-purpose intelligent agent to achieve true versatility, it must also master these fundamental skills. While modern Multimodal Large Language Models (MLLMs) leverage their extensive common knowledge for high-level planning in embodied AI and in downstream Vision-Language-Action (VLA) models, the extent of their true understanding of physical tools remains unquantified. To bridge this gap, we present PhysToolBench, the first benchmark dedicated to evaluating the comprehension of physical tools by MLLMs. Our benchmark is structured as a Visual Question Answering (VQA) dataset comprising over 1,000 image-text pairs. It assesses capabilities across three distinct difficulty levels: (1) Tool Recognition: Requiring the recognition of a tool's primary function. (2) Tool Understanding: Testing the ability to grasp the underlying principles of a tool's operation. (3) Tool Creation: Challenging the model to fashion a new tool from surrounding objects when conventional options are unavailable. Our comprehensive evaluation of 32 MLLMs-spanning proprietary, open-source, specialized embodied, and backbones in VLAs-reveals a significant deficiency in tool understanding. Furthermore, we provide an in-depth analysis and propose preliminary solutions. Code and dataset are publicly available.