🤖 AI Summary
Existing vision-language models (VLMs) achieve strong performance on visual question answering (VQA), yet their capacity for abstracting and reasoning about object attributes remains poorly understood—largely due to limitations in current benchmarks, including narrow attribute coverage (e.g., only size), conflation of perception and reasoning, low reasoning complexity, and limited image diversity. Method: We introduce ORBIT—the first systematic benchmark for object-attribute reasoning—featuring three categories of real-world images, four attribute dimensions (physical, functional, spatial, and counting), and three levels of reasoning depth, with explicit decoupling of perception from reasoning. Grounded in embodied cognition and commonsense theories, ORBIT employs controllable generation and structured templates, augmented with counterfactual physical/functional reasoning challenges. Contribution/Results: Zero-shot evaluation across 12 state-of-the-art VLMs reveals a best-model accuracy of only 40%, substantially below human performance, exposing fundamental bottlenecks in real-image understanding, fine-grained counting, and multi-step abstract reasoning.
📝 Abstract
While vision-language models (VLMs) have made remarkable progress on many popular visual question answering (VQA) benchmarks, it remains unclear whether they abstract and reason over depicted objects. Inspired by human object categorisation, object property reasoning involves identifying and recognising low-level details and higher-level abstractions. While current VQA benchmarks consider a limited set of object property attributes like size, they typically blend perception and reasoning, and lack representativeness in terms of reasoning and image categories. To this end, we introduce a systematic evaluation framework with images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions driven by prior work on commonsense reasoning. We develop a procedure to instantiate this benchmark into ORBIT, a multi-level reasoning VQA benchmark for object properties comprising 360 images paired with a total of 1,080 count-based questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations compared to humans, with the best-performing model only reaching 40% accuracy. VLMs struggle particularly with realistic (photographic) images, counterfactual reasoning about physical and functional properties, and higher counts. ORBIT points to the need to develop methods for scalable benchmarking, generalize annotation guidelines, and explore additional reasoning VLMs. We make the ORBIT benchmark and the experimental code available to support such endeavors.