🤖 AI Summary
This paper addresses the challenge of early detection and intervention for road rage by introducing a novel task—causal reasoning for road rage—aimed at identifying triggering events and inferring underlying causes from multimodal inputs *before* driver anger escalates, thereby enabling proactive, dialogue-based emotional soothing. Methodologically, we establish the first fine-grained vision–language joint reasoning benchmark tailored to driving scenarios, comprising a curated annotated dataset and a dedicated causal evaluation protocol; our approach integrates vision-language models (VLMs), spatial relation modeling, and event-level causal inference. Experiments expose critical deficiencies in state-of-the-art VLMs regarding driving-specific visual understanding and textual spatial relation modeling. Key contributions include: (1) formalizing the first causal reasoning task for road rage; (2) releasing the first domain-specific multimodal benchmark for evaluation; and (3) providing quantifiable technical insights to advance causally grounded, emotion-aware intervention systems.
📝 Abstract
Road rage, triggered by driving-related stimuli such as traffic congestion and aggressive driving, poses a significant threat to road safety. Previous research on road rage regulation has primarily focused on response suppression, lacking proactive prevention capabilities. With the advent of Vision-Language Models (VLMs), it has become possible to reason about trigger events visually and then engage in dialog-based comforting before drivers' anger escalates. To this end, we propose the road rage reasoning task, along with a finely annotated test dataset and evaluation metrics, to assess the capabilities of current mainstream VLMs in scene understanding, event recognition, and road rage reasoning. The results indicate that current VLMs exhibit significant shortcomings in scene understanding within the visual modality, as well as in comprehending the spatial relationships between objects in the textual modality. Improving VLMs' performance in these areas will greatly benefit downstream tasks like antecedent-focused road rage regulation.