SafeMVDrive: Multi-view Safety-Critical Driving Video Synthesis in the Real World Domain

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current end-to-end autonomous driving (E2E AD) systems suffer from a scarcity of real-world, multi-view, safety-critical driving videos, hindering robustness evaluation and improvement. To address this, we propose the first real-scenario-driven framework for synthesizing multi-view safety-critical driving videos. Our method comprises three core components: (1) vision-context-enhanced trajectory generation; (2) a two-stage controllable collision-avoidance mechanism; and (3) an integrated pipeline combining GRPO-finetuned vision-language models, diffusion-based multi-view video generation, and perception-guided safe trajectory planning. Experiments demonstrate that our synthesized videos significantly improve the collision detection rate of E2E planners under stress testing. The code, dataset, and sample videos are publicly released.

Technology Category

Application Category

📝 Abstract
Safety-critical scenarios are rare yet pivotal for evaluating and enhancing the robustness of autonomous driving systems. While existing methods generate safety-critical driving trajectories, simulations, or single-view videos, they fall short of meeting the demands of advanced end-to-end autonomous systems (E2E AD), which require real-world, multi-view video data. To bridge this gap, we introduce SafeMVDrive, the first framework designed to generate high-quality, safety-critical, multi-view driving videos grounded in real-world domains. SafeMVDrive strategically integrates a safety-critical trajectory generator with an advanced multi-view video generator. To tackle the challenges inherent in this integration, we first enhance scene understanding ability of the trajectory generator by incorporating visual context -- which is previously unavailable to such generator -- and leveraging a GRPO-finetuned vision-language model to achieve more realistic and context-aware trajectory generation. Second, recognizing that existing multi-view video generators struggle to render realistic collision events, we introduce a two-stage, controllable trajectory generation mechanism that produces collision-evasion trajectories, ensuring both video quality and safety-critical fidelity. Finally, we employ a diffusion-based multi-view video generator to synthesize high-quality safety-critical driving videos from the generated trajectories. Experiments conducted on an E2E AD planner demonstrate a significant increase in collision rate when tested with our generated data, validating the effectiveness of SafeMVDrive in stress-testing planning modules. Our code, examples, and datasets are publicly available at: https://zhoujiawei3.github.io/SafeMVDrive/.
Problem

Research questions and friction points this paper is trying to address.

Generates real-world multi-view safety-critical driving videos
Enhances trajectory generation with visual context and vision-language models
Improves realism of collision events in multi-view video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates trajectory and multi-view video generators
Uses GRPO-finetuned vision-language model
Employs diffusion-based multi-view video synthesis
J
Jiawei Zhou
Harbin Institute of Technology, Shenzhen
L
Linye Lyu
Harbin Institute of Technology, Shenzhen
Zhuotao Tian
Zhuotao Tian
Professor, Harbin Institute of Technology (Shenzhen)
Vision-language ModelMulti-modal PerceptionComputer Vision
Cheng Zhuo
Cheng Zhuo
Zhejiang University
EDA algorithmsVLSI design
Y
Yu Li
Zhejiang University