WorldMesh: Generating Navigable Multi-Room 3D Scenes via Mesh-Conditioned Image Diffusion

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the challenge of maintaining structural consistency in large-scale 3D scene generation from text, a limitation inherent in current text-to-image/video methods due to their lack of explicit geometric modeling. The authors propose a geometry-first, two-stage framework: first generating an environment-level mesh scaffold—such as walls and floors—from textual descriptions, then conditioning a diffusion model on this scaffold to synthesize photorealistic images while jointly performing semantic segmentation and object reconstruction. By leveraging an explicit geometric backbone, the approach effectively decouples structure from appearance, enabling high object diversity, strong global 3D consistency, and photorealistic detail across arbitrarily scaled multi-room scenes. This facilitates the construction of navigable, realistic 3D environments directly from language prompts.

Technology Category

Application Category

📝 Abstract

Recent progress in image and video synthesis has inspired their use in advancing 3D scene generation. However, we observe that text-to-image and -video approaches struggle to maintain scene- and object-level consistency beyond a limited environment scale due to the absence of explicit geometry. We thus present a geometry-first approach that decouples this complex problem of large-scale 3D scene synthesis into its structural composition, represented as a mesh scaffold, and realistic appearance synthesis, which leverages powerful image synthesis models conditioned on the mesh scaffold. From an input text description, we first construct a mesh capturing the environment's geometry (walls, floors, etc.), and then use image synthesis, segmentation and object reconstruction to populate the mesh structure with objects in realistic layouts. This mesh scaffold is then rendered to condition image synthesis, providing a structural backbone for consistent appearance generation. This enables scalable, arbitrarily-sized 3D scenes of high object richness and diversity, combining robust 3D consistency with photorealistic detail. We believe this marks a significant step toward generating truly environment-scale, immersive 3D worlds.

Problem

Research questions and friction points this paper is trying to address.

3D scene generation

scene consistency

geometry representation

large-scale environments

object-level consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

mesh-conditioned diffusion

3D scene generation

geometry-first approach