๐ค AI Summary
Existing outdoor 3D semantic scene generation is severely constrained by the scarcity of high-quality annotated dataโespecially for cross-modal generation from hand-drawn sketches to 3D urban scenes. To address this, we propose CymbaDiff, the first model to introduce a structured spatial diffusion mechanism that explicitly encodes cylindrical continuity and vertical stratification, ensuring geometric coherence and global contextual consistency. Leveraging the Mamba architecture, our method enhances long-range dependency modeling; it further integrates sketch-guided LiDAR voxel generation with satellite-image-derived pseudo-label supervision. Evaluated on our newly constructed large-scale benchmark SketchSem3D, CymbaDiff achieves significant improvements in semantic consistency, spatial realism, and cross-domain generalization. This work establishes a novel paradigm for autonomous driving simulation and urban digital twin applications.
๐ Abstract
Outdoor 3D semantic scene generation produces realistic and semantically rich environments for applications such as urban simulation and autonomous driving. However, advances in this direction are constrained by the absence of publicly available, well-annotated datasets. We introduce SketchSem3D, the first large-scale benchmark for generating 3D outdoor semantic scenes from abstract freehand sketches and pseudo-labeled annotations of satellite images. SketchSem3D includes two subsets, Sketch-based SemanticKITTI and Sketch-based KITTI-360 (containing LiDAR voxels along with their corresponding sketches and annotated satellite images), to enable standardized, rigorous, and diverse evaluations. We also propose Cylinder Mamba Diffusion (CymbaDiff) that significantly enhances spatial coherence in outdoor 3D scene generation. CymbaDiff imposes structured spatial ordering, explicitly captures cylindrical continuity and vertical hierarchy, and preserves both physical neighborhood relationships and global context within the generated scenes. Extensive experiments on SketchSem3D demonstrate that CymbaDiff achieves superior semantic consistency, spatial realism, and cross-dataset generalization. The code and dataset will be available at https://github.com/Lillian-research-hub/CymbaDiff