🤖 AI Summary
This study addresses the challenge of enhancing autonomous driving systems’ causal understanding and safety testing capabilities for high-risk scenarios by integrating real-world traffic accident causality into egocentric synthetic videos. To this end, we propose Causal-VidSyn—the first causal-aware video diffusion model that jointly incorporates accident cause descriptions and driver gaze cues—and introduce Drive-Gaze, a large-scale driving gaze dataset. Causal-VidSyn features three synergistic components: a causal entity localization module, a gaze-conditioned selection module, and an accident cause question-answering module, enabling fine-grained causal control. Experiments demonstrate significant improvements over state-of-the-art methods in both video fidelity and causal sensitivity. The framework supports three key tasks: accident video editing, normal-to-accident video generation, and text-to-video synthesis. Collectively, this work establishes a novel paradigm for causality-driven robustness evaluation of autonomous driving systems.
📝 Abstract
Egocentricly comprehending the causes and effects of car accidents is crucial for the safety of self-driving cars, and synthesizing causal-entity reflected accident videos can facilitate the capability test to respond to unaffordable accidents in reality. However, incorporating causal relations as seen in real-world videos into synthetic videos remains challenging. This work argues that precisely identifying the accident participants and capturing their related behaviors are of critical importance. In this regard, we propose a novel diffusion model, Causal-VidSyn, for synthesizing egocentric traffic accident videos. To enable causal entity grounding in video diffusion, Causal-VidSyn leverages the cause descriptions and driver fixations to identify the accident participants and behaviors, facilitated by accident reason answering and gaze-conditioned selection modules. To support Causal-VidSyn, we further construct Drive-Gaze, the largest driver gaze dataset (with 1.54M frames of fixations) in driving accident scenarios. Extensive experiments show that Causal-VidSyn surpasses state-of-the-art video diffusion models in terms of frame quality and causal sensitivity in various tasks, including accident video editing, normal-to-accident video diffusion, and text-to-video generation.