AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing methods for generating high-fidelity, fine-grained controllable driving scenes are limited by shallow conditioning mechanisms and reliance on reference frames, hindering support for arbitrary bird’s-eye-view (BEV) layouts and scalable simulation. This work proposes AnyScene, a novel framework that introduces, for the first time, an autoregressive spatio-temporal occupancy diffusion Transformer to generate semantic occupancy sequences directly from user-defined or cross-dataset BEV layouts. Coupled with a geometry-anchored view extrapolation module, AnyScene enables reference-free, temporally consistent multi-view video synthesis. By unifying scene representation through occupancy, the method supports long-horizon generation under arbitrary camera configurations, achieving state-of-the-art performance in both occupancy prediction and video synthesis. It significantly improves generalization to unseen and customized layouts and enhances downstream tasks such as sparse-view 3D reconstruction.

📝 Abstract

Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.

Problem

Research questions and friction points this paper is trying to address.

controllable scene generation

autonomous driving

BEV layout

occupancy representation

synthetic data

Innovation

Methods, ideas, or system contributions that make the work stand out.

occupancy-centric generation

BEV-to-video synthesis

autoregressive diffusion transformer