Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

📅 2025-10-26

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

High-quality annotated occupancy data remains scarce, severely limiting the generation of realistic autonomous driving scenarios. To address this, we introduce NuPlan-Occ—the largest semantic occupancy dataset to date—and propose the first unified multimodal generative framework capable of jointly synthesizing high-fidelity 4D semantic occupancy grids, multi-view videos, and LiDAR point clouds. Methodologically, our approach employs a spatiotemporally decoupled network to model dynamic scene evolution, integrates Gaussian splatting–based sparse point-map rendering to enhance geometric fidelity, and incorporates sensor-aware perception embeddings to ensure cross-modal consistency. Extensive experiments demonstrate that our method significantly outperforms existing approaches in generation quality, temporal coherence, and inter-modal alignment. Moreover, the synthesized data exhibits strong generalization and practical utility in downstream perception and motion planning tasks.

Technology Category

Application Category

📝 Abstract

Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2

Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of annotated occupancy data for autonomous driving

Developing unified framework for multimodal driving scene generation

Bridging modal gaps between occupancy, video and LiDAR data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest semantic occupancy dataset Nuplan-Occ

Spatio-temporal disentangled architecture for 4D occupancy

Gaussian splatting rendering and sensor-aware embedding

🔎 Similar Papers

No similar papers found.