CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing outdoor 3D semantic scene generation is severely constrained by the scarcity of high-quality annotated data—especially for cross-modal generation from hand-drawn sketches to 3D urban scenes. To address this, we propose CymbaDiff, the first model to introduce a structured spatial diffusion mechanism that explicitly encodes cylindrical continuity and vertical stratification, ensuring geometric coherence and global contextual consistency. Leveraging the Mamba architecture, our method enhances long-range dependency modeling; it further integrates sketch-guided LiDAR voxel generation with satellite-image-derived pseudo-label supervision. Evaluated on our newly constructed large-scale benchmark SketchSem3D, CymbaDiff achieves significant improvements in semantic consistency, spatial realism, and cross-domain generalization. This work establishes a novel paradigm for autonomous driving simulation and urban digital twin applications.

Technology Category

Application Category

📝 Abstract

Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at https://github.com/Lillian-research-hub/CymbaDiff

Problem

Research questions and friction points this paper is trying to address.

Generating 3D outdoor semantic scenes from abstract freehand sketches

Overcoming absence of public annotated datasets for urban scene generation

Enhancing spatial coherence and semantic consistency in generated scenes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Imposes structured spatial ordering for coherence

Captures cylindrical continuity and vertical hierarchy

Preserves neighborhood relationships and global context

🔎 Similar Papers

LT3SD: Latent Trees for 3D Scene Diffusion