Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Current vision-language models exhibit significant limitations in hierarchical scene reasoning—e.g., inferring task-relevant deep features (material, function, physical properties) from object identity and spatial relations—while prevailing benchmarks emphasize superficial recognition or image-text alignment, lacking systematic evaluation of compositional reasoning. Method: We propose a hierarchical scene understanding framework and introduce the first embodied visual reasoning benchmark: built upon 3,173 objects and 84 fine-grained attribute annotations, it comprises 28,000 multi-step, multiple-choice questions spanning synthetic and real-world scenes. Contribution/Results: Experiments show mainstream models suffer 10–20% performance degradation on attribute-driven tasks; incorporating simulated scene-contextual prompts substantially improves their accuracy on both real-world and expert-crafted questions. This reveals a critical structural reasoning bottleneck and advances human-like perceptual reasoning in vision-language systems.

Technology Category

Application Category

📝 Abstract

We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.

Problem

Research questions and friction points this paper is trying to address.

Evaluating hierarchical scene reasoning abilities in vision-language models

Addressing gaps in physically grounded visual reasoning benchmarks

Improving property-driven multi-step reasoning over structured attributes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical scene reasoning with object recognition and spatial configurations

Benchmark with 3173 objects annotated across 84 fine-grained attributes

Perceptual-taxonomy-guided prompting using in-context reasoning examples

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts