🤖 AI Summary
Current vision-language models exhibit significant limitations in hierarchical scene reasoning—e.g., inferring task-relevant deep features (material, function, physical properties) from object identity and spatial relations—while prevailing benchmarks emphasize superficial recognition or image-text alignment, lacking systematic evaluation of compositional reasoning. Method: We propose a hierarchical scene understanding framework and introduce the first embodied visual reasoning benchmark: built upon 3,173 objects and 84 fine-grained attribute annotations, it comprises 28,000 multi-step, multiple-choice questions spanning synthetic and real-world scenes. Contribution/Results: Experiments show mainstream models suffer 10–20% performance degradation on attribute-driven tasks; incorporating simulated scene-contextual prompts substantially improves their accuracy on both real-world and expert-crafted questions. This reveals a critical structural reasoning bottleneck and advances human-like perceptual reasoning in vision-language systems.
📝 Abstract
We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment.
To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning.
Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.