🤖 AI Summary
To address the scarcity of real annotated data and domain shift in synthetic data for autonomous driving 3D semantic segmentation, this paper proposes the first end-to-end 3D semantic scene generation framework. Unlike existing approaches relying on image projection or decoupled multi-resolution modeling, our method directly synthesizes full-scene point clouds with per-point semantic labels, enabling joint geometric-semantic modeling. Built upon a pure 3D diffusion model, it avoids projection artifacts and errors from intermediate representations. Experiments demonstrate that downstream segmentation models trained jointly on synthetic and real data achieve significantly improved mIoU. The framework effectively alleviates the annotation bottleneck and supports scalable training set expansion. Our core contribution is the first end-to-end, projection-free 3D scene generation framework that co-models geometry and semantics in a unified manner.
📝 Abstract
Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still however a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.