E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes

📅 2026-04-20

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing visual search and embodied AI benchmarks struggle to evaluate fine-grained viewpoint-dependent phenomena in realistic 3D environments arising from five-degree-of-freedom (5-DoF) camera motion—such as visibility changes due to vertical angles or the revelation of container contents. To address this gap, this work introduces the first high-fidelity 3D visual search benchmark that supports full 5-DoF viewpoint control, reconstructing 99 scenes using 3D Gaussian splatting and formulating 2,014 questions requiring active cross-viewpoint exploration to answer. The benchmark uniquely integrates viewpoint-dependent perception, active evidence gathering, and coherent viewpoint planning into a unified evaluation framework. Experiments reveal that state-of-the-art vision-language models significantly underperform humans on this task, highlighting their limitations in active perception and sophisticated viewpoint planning.

Technology Category

Application Category

📝 Abstract

Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.

Problem

Research questions and friction points this paper is trying to address.

viewpoint-dependent perception

embodied visual search

3D Gaussian Splatting

5-DoF viewpoint control

active perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Gaussian Splatting

viewpoint-dependent perception

embodied visual search