FAR-Drive: Frame-AutoRegressive Video Generation in Closed-Loop Autonomous Driving

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited scalability and lack of frame-level interactivity in existing autonomous driving simulation environments, which hinder effective closed-loop training and evaluation. The authors propose a frame-level autoregressive video generation framework that leverages a multi-view diffusion Transformer with structured action conditioning to synthesize geometrically consistent multi-camera videos. To mitigate iterative degradation and ensure cross-view and temporal consistency, they introduce adaptive reference temporal conditioning and a hybrid teacher-forcing training strategy. Furthermore, system-level inference optimizations enable sub-second, low-latency video generation on a single GPU. Evaluated on the nuScenes benchmark, the method achieves state-of-the-art performance in closed-loop simulation fidelity and effectiveness.

Technology Category

Application Category

📝 Abstract
Despite rapid progress in autonomous driving, reliable training and evaluation of driving systems remain fundamentally constrained by the lack of scalable and interactive simulation environments. Recent generative video models achieve remarkable visual fidelity, yet most operate in open-loop settings and fail to support fine-grained frame-level interaction between agent actions and environment evolution. Building a learning-based closed-loop simulator for autonomous driving poses three major challenges: maintaining long-horizon temporal and cross-view consistency, mitigating autoregressive degradation under iterative self-conditioning, and satisfying low-latency inference constraints. In this work, we propose FAR-Drive, a frame-level autoregressive video generation framework for autonomous driving. We introduce a multi-view diffusion transformer with fine-grained structured control, enabling geometrically consistent multi-camera generation. To address long-horizon consistency and iterative degradation, we design a two-stage training strategy consisting of adaptive reference horizon conditioning and blend-forcing autoregressive training, which progressively improves consistency and robustness under self-conditioning. To meet low-latency interaction requirements, we further integrate system-level efficiency optimizations for inference acceleration. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance among existing closed-loop autonomous driving simulation approaches, while maintaining sub-second latency on a single GPU.
Problem

Research questions and friction points this paper is trying to address.

closed-loop simulation
autoregressive video generation
temporal consistency
interactive autonomous driving
low-latency inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

frame-autoregressive
closed-loop simulation
multi-view diffusion transformer
structured control
low-latency inference