🤖 AI Summary
This work addresses the limited scalability and lack of frame-level interactivity in existing autonomous driving simulation environments, which hinder effective closed-loop training and evaluation. The authors propose a frame-level autoregressive video generation framework that leverages a multi-view diffusion Transformer with structured action conditioning to synthesize geometrically consistent multi-camera videos. To mitigate iterative degradation and ensure cross-view and temporal consistency, they introduce adaptive reference temporal conditioning and a hybrid teacher-forcing training strategy. Furthermore, system-level inference optimizations enable sub-second, low-latency video generation on a single GPU. Evaluated on the nuScenes benchmark, the method achieves state-of-the-art performance in closed-loop simulation fidelity and effectiveness.
📝 Abstract
Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.