🤖 AI Summary
To address the limitations in evaluating autonomous driving policies using real-world driving data—particularly the scarcity of rare hazardous or non-expert behaviors—this paper proposes a Controllable Driving World Model. It integrates expert trajectories with diverse, synthetically generated non-expert behaviors via simulation to enable high-fidelity, highly controllable open-scenario future prediction. Methodologically, we introduce the first heterogeneous-data-driven training paradigm and design a Video2Reward module that enables end-to-end differentiable mapping from generated video sequences to reward signals. Our architecture employs a diffusion-based Transformer, multi-source conditional fusion, CARLA-based data augmentation, and a dedicated reward estimation network. Experiments demonstrate a 44% improvement in video visual fidelity, over 50% gains in controllability of both expert and non-expert actions, a 2% boost in NAVSIM planning performance, and a 25% increase in policy selection accuracy.
📝 Abstract
How can we reliably simulate future driving scenarios under a wide range of ego driving behaviors? Recent driving world models, developed exclusively on real-world driving data composed mainly of safe expert trajectories, struggle to follow hazardous or non-expert behaviors, which are rare in such data. This limitation restricts their applicability to tasks such as policy evaluation. In this work, we address this challenge by enriching real-world human demonstrations with diverse non-expert data collected from a driving simulator (e.g., CARLA), and building a controllable world model trained on this heterogeneous corpus. Starting with a video generator featuring a diffusion transformer architecture, we devise several strategies to effectively integrate conditioning signals and improve prediction controllability and fidelity. The resulting model, ReSim, enables Reliable Simulation of diverse open-world driving scenarios under various actions, including hazardous non-expert ones. To close the gap between high-fidelity simulation and applications that require reward signals to judge different actions, we introduce a Video2Reward module that estimates a reward from ReSim's simulated future. Our ReSim paradigm achieves up to 44% higher visual fidelity, improves controllability for both expert and non-expert actions by over 50%, and boosts planning and policy selection performance on NAVSIM by 2% and 25%, respectively.