Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method

📅 2025-10-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
High-quality annotated occupancy data remains scarce, severely limiting the generation of realistic autonomous driving scenarios. To address this, we introduce NuPlan-Occ—the largest semantic occupancy dataset to date—and propose the first unified multimodal generative framework capable of jointly synthesizing high-fidelity 4D semantic occupancy grids, multi-view videos, and LiDAR point clouds. Methodologically, our approach employs a spatiotemporally decoupled network to model dynamic scene evolution, integrates Gaussian splatting–based sparse point-map rendering to enhance geometric fidelity, and incorporates sensor-aware perception embeddings to ensure cross-modal consistency. Extensive experiments demonstrate that our method significantly outperforms existing approaches in generation quality, temporal coherence, and inter-modal alignment. Moreover, the synthesized data exhibits strong generalization and practical utility in downstream perception and motion planning tasks.

Technology Category

Application Category

📝 Abstract
Driving scene generation is a critical domain for autonomous driving, enabling downstream applications, including perception and planning evaluation. Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities; however, their performance heavily depends on annotated occupancy data, which still remains scarce. To overcome this limitation, we curate Nuplan-Occ, the largest semantic occupancy dataset to date, constructed from the widely used Nuplan benchmark. Its scale and diversity facilitate not only large-scale generative modeling but also autonomous driving downstream applications. Based on this dataset, we develop a unified framework that jointly synthesizes high-quality semantic occupancy, multi-view videos, and LiDAR point clouds. Our approach incorporates a spatio-temporal disentangled architecture to support high-fidelity spatial expansion and temporal forecasting of 4D dynamic occupancy. To bridge modal gaps, we further propose two novel techniques: a Gaussian splatting-based sparse point map rendering strategy that enhances multi-view video generation, and a sensor-aware embedding strategy that explicitly models LiDAR sensor properties for realistic multi-LiDAR simulation. Extensive experiments demonstrate that our method achieves superior generation fidelity and scalability compared to existing approaches, and validates its practical value in downstream tasks. Repo: https://github.com/Arlo0o/UniScene-Unified-Occupancy-centric-Driving-Scene-Generation/tree/v2
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of annotated occupancy data for autonomous driving
Developing unified framework for multimodal driving scene generation
Bridging modal gaps between occupancy, video and LiDAR data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest semantic occupancy dataset Nuplan-Occ
Spatio-temporal disentangled architecture for 4D occupancy
Gaussian splatting rendering and sensor-aware embedding
🔎 Similar Papers
No similar papers found.
B
Bohan Li
Shanghai Jiao Tong University, Shanghai, China, and Eastern Institute of Technology, Ningbo, China
X
Xin Jin
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China
Hu Zhu
Hu Zhu
College of Telecommunications and Information Engineering Nanjing University of Posts and
Computational Photography3D imagingtarget detectioninfrared imaging
H
Hongsi Liu
Eastern Institute of Technology, Ningbo, China
R
Ruikai Li
Li Auto, Beijing, China
J
Jiazhe Guo
Li Auto, Beijing, China
K
Kaiwen Cai
Li Auto, Beijing, China
C
Chao Ma
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
Yueming Jin
Yueming Jin
Assistant Professor, National University of Singapore
Medical Image AnalysisSurgical AI&RoboticsMultimodal Learning
H
Hao Zhao
Tsinghua University, Beijing, China
X
Xiaokang Yang
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
W
Wenjun Zeng
Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China