WorldExplorer: Towards Generating Fully Navigable 3D Scenes

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of visual quality degradation—such as geometric stretching, noise amplification, and view inconsistency—under large-scale camera motion in text-to-3D scene generation. We propose the first navigable 3D generation framework built iteratively via autoregressive video trajectory modeling. Our method integrates multi-view-consistent panoramic initialization, video diffusion modeling, scene-memory-conditioned control, collision-aware trajectory generation, and 3D Gaussian splatting optimization. Crucially, we introduce an explicit scene memory module and a physics-based collision detection mechanism to ensure geometric coherence and multi-view consistency. Experiments demonstrate that our approach enables unrestricted free-camera navigation while maintaining high-fidelity, temporally stable, and physically plausible rendering even under drastic pose changes. To our knowledge, this is the first method achieving photorealistic, full-view-consistent, text-driven interactive 3D scene navigation.

Technology Category

Application Category

📝 Abstract
Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
Problem

Research questions and friction points this paper is trying to address.

Generating fully navigable 3D scenes from text
Overcoming noisy artifacts in non-central viewpoints
Ensuring consistent visual quality across wide viewpoints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive video trajectory for 3D scenes
Scene memory conditions relevant prior views
3D Gaussian Splatting optimizes unified representation
🔎 Similar Papers
M
Manuel-Andreas Schneider
Technical University of Munich, Germany
Lukas Höllein
Lukas Höllein
PhD Student at Technical University of Munich
Computer VisionMachine/Deep LearningComputer GraphicsNeural Rendering
M
M. Nießner
Technical University of Munich, Germany