CymbaDiff: Structured Spatial Diffusion for Sketch-based 3D Semantic Urban Scene Generation

๐Ÿ“… 2025-10-15
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing outdoor 3D semantic scene generation is severely constrained by the scarcity of high-quality annotated dataโ€”especially for cross-modal generation from hand-drawn sketches to 3D urban scenes. To address this, we propose CymbaDiff, the first model to introduce a structured spatial diffusion mechanism that explicitly encodes cylindrical continuity and vertical stratification, ensuring geometric coherence and global contextual consistency. Leveraging the Mamba architecture, our method enhances long-range dependency modeling; it further integrates sketch-guided LiDAR voxel generation with satellite-image-derived pseudo-label supervision. Evaluated on our newly constructed large-scale benchmark SketchSem3D, CymbaDiff achieves significant improvements in semantic consistency, spatial realism, and cross-domain generalization. This work establishes a novel paradigm for autonomous driving simulation and urban digital twin applications.

Technology Category

Application Category

๐Ÿ“ Abstract
Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at https://github.com/Lillian-research-hub/CymbaDiff
Problem

Research questions and friction points this paper is trying to address.

Generating 3D outdoor semantic scenes from abstract freehand sketches
Overcoming absence of public annotated datasets for urban scene generation
Enhancing spatial coherence and semantic consistency in generated scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Imposes structured spatial ordering for coherence
Captures cylindrical continuity and vertical hierarchy
Preserves neighborhood relationships and global context
๐Ÿ”Ž Similar Papers
No similar papers found.
Li Liang
Li Liang
The University of Western Australia
3D Point Cloud Processing3D Semantic Scene Completion3D Semantic Scene Generation
B
Bo Miao
AIML, The University of Adelaide
X
Xinyu Wang
The University of Western Australia
N
Naveed Akhtar
The University of Melbourne
Jordan Vice
Jordan Vice
Ph.D.
artificial intelligencemachine learningexplainable AIaffective computing
A
Ajmal Mian
The University of Western Australia