InfiniBench: Infinite Benchmarking for Visual Spatial Reasoning with Customizable Scene Complexity

📅 2025-11-22

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing vision-language models (VLMs) lack a benchmark for spatial reasoning that simultaneously ensures diversity, scalability, and fine-grained controllability. Method: We introduce the first fully automated, parameterized, and controllable 3D visual spatial reasoning evaluation generation system. It integrates an LLM-driven agent framework, a cluster-based layout optimizer, and task-aware camera trajectory planning to synthesize high-fidelity, physically plausible complex-scene videos. The system precisely converts natural language descriptions into embodied video inputs, supporting diverse spatial reasoning tasks—including measurement, viewpoint understanding, and spatiotemporal tracking. Contribution/Results: Our system enables infinite generation of test samples with tunable complexity, significantly improving prompt fidelity and physical plausibility—outperforming prior methods in high-complexity scenarios. It establishes a customizable, reproducible paradigm for evaluating VLMs’ spatial reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Modern vision-language models (VLMs) are expected to have abilities of spatial reasoning with diverse scene complexities, but evaluating such abilities is difficult due to the lack of benchmarks that are not only diverse and scalable but also fully customizable. Existing benchmarks offer limited customizability over the scene complexity and are incapable of isolating and analyzing specific VLM failure modes under distinct spatial conditions. To address this gap, instead of individually presenting benchmarks for different scene complexities, in this paper we present InfiniBench, a fully automated, customizable and user-friendly benchmark generator that can synthesize a theoretically infinite variety of 3D scenes with parameterized control on scene complexity. InfiniBench uniquely translates scene descriptions in natural language into photo-realistic videos with complex and physically plausible 3D layouts. This is achieved through three key innovations: 1) a LLM-based agentic framework that iteratively refines procedural scene constraints from scene descriptions; 2) a flexible cluster-based layout optimizer that generates dense and cluttered scenes previously intractable for procedural methods; and 3) a task-aware camera trajectory optimization method that renders scenes into videos with full object coverage as VLM input. Experiments demonstrate that InfiniBench outperforms state-of-the-art procedural and LLM-based 3D generation methods in prompt fidelity and physical plausibility, especially in high-complexity scenarios. We further showcased the usefulness of InfiniBench, by generating benchmarks for representative spatial reasoning tasks including measurement, perspective-taking and spatiotemporal tracking.

Problem

Research questions and friction points this paper is trying to address.

Lack of customizable benchmarks for evaluating visual spatial reasoning abilities

Inability to isolate specific VLM failure modes under different spatial conditions

Limited scalability and diversity in existing spatial reasoning evaluation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based agentic framework refines procedural scene constraints

Flexible cluster-based layout optimizer generates dense cluttered scenes

Task-aware camera trajectory optimization renders full coverage videos

🔎 Similar Papers

No similar papers found.