🤖 AI Summary
Current vision-language models (VLMs) excel on coarse-grained spatial relation benchmarks (e.g., “left of”, “behind”) but critically lack fine-grained spatial understanding essential for object interaction—namely, precise 3D localization, physical plausibility, affordance modeling, and multi-step spatial planning. To address this gap, we introduce BOP-ASK, the first multi-task benchmark explicitly designed for object-interaction reasoning. It comprises four novel tasks: 6D pose-driven grasp pose generation, fine-grained relative spatial relation recognition, depth-aware spatial reasoning, and multi-step trajectory planning. We further construct BOP-ASK-lab—a challenging out-of-distribution test set—and leverage the BOP dataset to automatically generate high-fidelity, multimodal annotations (including 6D poses, spatial relations, collision-free paths, and centered spatial descriptions). Experiments demonstrate that models trained on BOP-ASK achieve substantial gains in fine-grained spatial reasoning over strong baselines, exhibiting strong generalization and emergent capabilities.
📝 Abstract
Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object relationships. BOP-ASK comprises over 150k images and 33M question answer pairs spanning six tasks (four novel), providing a rich resource for training and evaluating VLMs. We evaluate proprietary and open sourced VLMs, and conduct human evaluations on BOP-ASK-core, a contributed test benchmark. We also release BOP-ASK-lab, an out-of-distribution benchmark with images not sourced from BOP, enabling testing of generalization. Our experiments demonstrate that models trained on BOP-ASK outperform baselines and exhibit emergent capabilities such as precise object and grasp pose estimation, trajectory planning, and fine-grained object-centric spatial reasoning in cluttered environments. We will publicly release our datasets and dataset generation pipeline.