Planning with the Views via Scene Self-Exploration

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing vision-language models struggle with multi-step viewpoint planning in 3D scenes, particularly lacking the ability to compose knowledge from individual navigation actions over long-range transformations. This work proposes an iterative framework that alternates between self-exploration and view-graph distillation. By constructing a structured view graph and converting exploration trajectories into multi-task supervision signals, the approach effectively mitigates the sparse reward problem and enhances the model’s capacity to capture structural relationships among viewpoints. The method substantially improves active reasoning and planning performance in 3D space, boosting the success rate of Qwen2.5-VL-7B on interactive viewpoint planning tasks from 2.5% to 47.8%, significantly outperforming GPT-5.4 Pro and Gemini 3.1 Pro.

📝 Abstract

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space.

Problem

Research questions and friction points this paper is trying to address.

view planning

visual language models

3D scene understanding

multi-turn planning

camera movement

Innovation

Methods, ideas, or system contributions that make the work stand out.

view planning

visual language models

self-exploration