🤖 AI Summary
Existing evaluation frameworks predominantly focus on single-arm manipulation and are ill-equipped to assess the spatiotemporal coordination capabilities of multimodal large language models in bimanual collaborative tasks. To address this gap, this work proposes the first three-tiered hierarchical benchmark specifically designed for bimanual coordination, encompassing spatial reasoning, motion planning, and end-effector control. The framework explicitly distinguishes between perceptual hallucinations and planning failures while incorporating inter-arm kinematic constraints. Comprehensive evaluation of over thirty state-of-the-art models reveals that, despite strong high-level reasoning abilities, they consistently exhibit significant deficiencies in bimanual spatial localization, temporal coordination, and mutual collision avoidance. This benchmark provides a systematic tool for evaluating and advancing intelligent agents in multi-arm collaborative settings.
📝 Abstract
Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.