🤖 AI Summary
Current end-to-end autonomous driving (E2E AD) systems suffer from a scarcity of real-world, multi-view, safety-critical driving videos, hindering robustness evaluation and improvement. To address this, we propose the first real-scenario-driven framework for synthesizing multi-view safety-critical driving videos. Our method comprises three core components: (1) vision-context-enhanced trajectory generation; (2) a two-stage controllable collision-avoidance mechanism; and (3) an integrated pipeline combining GRPO-finetuned vision-language models, diffusion-based multi-view video generation, and perception-guided safe trajectory planning. Experiments demonstrate that our synthesized videos significantly improve the collision detection rate of E2E planners under stress testing. The code, dataset, and sample videos are publicly released.
📝 Abstract
Safety-critical scenarios are rare yet pivotal for evaluating and enhancing the robustness of autonomous driving systems. While existing methods generate safety-critical driving trajectories, simulations, or single-view videos, they fall short of meeting the demands of advanced end-to-end autonomous systems (E2E AD), which require real-world, multi-view video data. To bridge this gap, we introduce SafeMVDrive, the first framework designed to generate high-quality, safety-critical, multi-view driving videos grounded in real-world domains. SafeMVDrive strategically integrates a safety-critical trajectory generator with an advanced multi-view video generator. To tackle the challenges inherent in this integration, we first enhance scene understanding ability of the trajectory generator by incorporating visual context -- which is previously unavailable to such generator -- and leveraging a GRPO-finetuned vision-language model to achieve more realistic and context-aware trajectory generation. Second, recognizing that existing multi-view video generators struggle to render realistic collision events, we introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories, ensuring both video quality and safety-critical fidelity. Finally, we employ a diffusion-based multi-view video generator to synthesize high-quality safety-critical driving videos from the generated trajectories. Experiments conducted on an E2E AD planner demonstrate a significant increase in collision rate when tested with our generated data, validating the effectiveness of SafeMVDrive in stress-testing planning modules. Our code, examples, and datasets are publicly available at: https://zhoujiawei3.github.io/SafeMVDrive/.