IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fundamental question of whether vision-language models (VLMs) possess genuine scene understanding capability. To this end, we introduce IR3D-Bench—a novel benchmark that departs from conventional descriptive evaluation and pioneers the “understanding through creation” paradigm: models must actively reconstruct the 3D structure of an image by invoking programming interfaces and differentiable/non-differentiable 3D renderers. Our method follows an analysis-by-synthesis framework, integrating procedural scene generation, renderer invocation, and multi-dimensional quantitative metrics to systematically assess geometric, spatial, and appearance fidelity. Experiments reveal significant limitations in current state-of-the-art VLMs’ 3D structural reconstruction accuracy. IR3D-Bench is the first VLM evaluation benchmark to incorporate embodied tool use and generative inverse rendering, establishing the first reproducible, decomposable, and extensible 3D cognitive benchmark for real-world scene understanding.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.
Problem

Research questions and friction points this paper is trying to address.

Assessing VLMs' true scene understanding via active 3D recreation
Evaluating VLAs' tool-using capacity for inverse rendering tasks
Measuring geometric and spatial accuracy in generative scene reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic inverse rendering via tool use
Understanding-by-creating benchmark approach
Comprehensive metrics for 3D recreation
🔎 Similar Papers
No similar papers found.