RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work proposes a novel framework for text-to-3D scene generation that overcomes key limitations of existing methods, which are typically constrained by fixed camera trajectories, lack semantic layout understanding, struggle to adaptively explore occluded regions, and rely on conventional 2D inpainting models ill-suited for handling holes induced by camera motion. The proposed approach integrates semantic relationship reasoning with geometric constraints: it first leverages a vision-language model to construct a semantic scene graph that guides object-aware, adaptive camera navigation; second, it introduces a motion-injected inpainting module that explicitly models camera motion to achieve temporally consistent hole filling. By jointly optimizing dynamic camera trajectories and semantic scene graphs—and fine-tuning on a synthetic panoramic dataset—the method significantly enhances photorealism and geometric consistency, outperforming current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

Generating immersive 3D scenes from texts is a core task in computer vision, crucial for applications in virtual reality and game development. Despite the promise of leveraging 2D diffusion priors, existing methods suffer from spatial blindness and rely on predefined trajectories that fail to exploit the inner relationships among salient objects. Consequently, these approaches are unable to comprehend the semantic layout, preventing them from exploring the scene adaptively to infer occluded content. Moreover, current inpainting models operate in 2D image space, struggling to plausibly fill holes caused by camera motion. To address these limitations, we propose RoamScene3D, a novel framework that bridges the gap between semantic guidance and spatial generation. Our method reasons about the semantic relations among objects and produces consistent and photorealistic scenes. Specifically, we employ a vision-language model (VLM) to construct a scene graph that encodes object relations, guiding the camera to perceive salient object boundaries and plan an adaptive roaming trajectory. Furthermore, to mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset integrating authentic camera trajectories, making it adaptive to camera motion. Extensive experiments demonstrate that with semantic reasoning and geometric constraints, our method significantly outperforms state-of-the-art approaches in producing consistent and photorealistic scenes. Our code is available at https://github.com/JS-CHU/RoamScene3D.

Problem

Research questions and friction points this paper is trying to address.

text-to-3D

scene generation

spatial blindness

adaptive roaming

occlusion inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

adaptive roaming

scene graph

motion-injected inpainting