🤖 AI Summary
Existing methods for generating high-fidelity, fine-grained controllable driving scenes are limited by shallow conditioning mechanisms and reliance on reference frames, hindering support for arbitrary bird’s-eye-view (BEV) layouts and scalable simulation. This work proposes AnyScene, a novel framework that introduces, for the first time, an autoregressive spatio-temporal occupancy diffusion Transformer to generate semantic occupancy sequences directly from user-defined or cross-dataset BEV layouts. Coupled with a geometry-anchored view extrapolation module, AnyScene enables reference-free, temporally consistent multi-view video synthesis. By unifying scene representation through occupancy, the method supports long-horizon generation under arbitrary camera configurations, achieving state-of-the-art performance in both occupancy prediction and video synthesis. It significantly improves generalization to unseen and customized layouts and enhances downstream tasks such as sparse-view 3D reconstruction.
📝 Abstract
Generating high-fidelity and controllable synthetic data is critical for advancing end-to-end autonomous driving, particularly for addressing the long tail of rare safety-critical scenarios. Existing occupancy-guided methods typically rely on shallow conditioning mechanisms and reference-frame-dependent video synthesis, which limits fine-grained controllability from arbitrary BEV layouts and restricts their applicability for scalable simulation. In this paper, we propose AnyScene, a unified occupancy-centric framework for driving scene generation. AnyScene generates semantic occupancy sequences from BEV layouts through a Spatial-Temporal Occupancy Diffusion Transformer that jointly tokenizes BEV and occupancy features in an autoregressive manner. This design enables precise controllability from cross-dataset and user-defined BEV inputs while naturally supporting long-horizon generation. Building upon the generated occupancy, a Geometry-Grounded View Expansion module treats occupancy as the canonical spatial representation and synthesizes temporally consistent multi-view driving videos in a reference-free and autoregressive fashion, supporting flexible camera configurations at inference time. Extensive experiments demonstrate that AnyScene achieves state-of-the-art performance in both occupancy and video generation. It exhibits strong generalization to unseen and customized layouts, and provides measurable benefits for downstream tasks such as sparse-view 3D reconstruction.