🤖 AI Summary
Existing autonomous driving data generation methods rely heavily on coarse scene layouts, making it challenging to jointly model and synthesize diverse, high-fidelity, multi-modal training data with precise annotations. This paper proposes the first hierarchical generative framework unified by semantic occupancy as an intermediate representation, simultaneously synthesizing three critical modalities: semantic occupancy grids, video sequences, and LiDAR point clouds. We introduce two novel transfer strategies: Gaussian joint rendering and prior-guided sparse modeling—integrating conditional diffusion, sparse geometric priors, and cross-domain generation. Our method achieves state-of-the-art performance across all three modal generation tasks and significantly improves downstream perception and motion planning accuracy. By enabling controllable, scalable, and annotation-consistent simulation data synthesis, this work establishes a new paradigm for autonomous driving data generation.
📝 Abstract
Generating high-fidelity, controllable, and annotated training data is critical for autonomous driving. Existing methods typically generate a single data form directly from a coarse scene layout, which not only fails to output rich data forms required for diverse downstream tasks but also struggles to model the direct layout-to-data distribution. In this paper, we introduce UniScene, the first unified framework for generating three key data forms - semantic occupancy, video, and LiDAR - in driving scenes. UniScene employs a progressive generation process that decomposes the complex task of scene generation into two hierarchical steps: (a) first generating semantic occupancy from a customized scene layout as a meta scene representation rich in both semantic and geometric information, and then (b) conditioned on occupancy, generating video and LiDAR data, respectively, with two novel transfer strategies of Gaussian-based Joint Rendering and Prior-guided Sparse Modeling. This occupancy-centric approach reduces the generation burden, especially for intricate scenes, while providing detailed intermediate representations for the subsequent generation stages. Extensive experiments demonstrate that UniScene outperforms previous SOTAs in the occupancy, video, and LiDAR generation, which also indeed benefits downstream driving tasks.