CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work addresses the lack of infrastructure-centric evaluation for vision-language models in traffic collision scenarios, which hinders their ability to support roadside perception and reasoning required for cooperative autonomous driving. The authors introduce the first large-scale video–language benchmark derived from real-world roadside camera footage, encompassing 250 collision events and 13K structured question-answer pairs. They propose a two-tiered evaluation framework: the lower tier assesses visual grounding of scene elements, while the upper tier evaluates high-level reasoning capabilities regarding collision mechanisms, causality, temporal dynamics, and consequences. Systematic evaluation of eight state-of-the-art models reveals significant deficiencies in temporal and causal reasoning. The publicly released dataset and standardized evaluation framework fill a critical gap in roadside intelligence assessment and advance infrastructure-assisted perception technologies.

Technology Category

Application Category

📝 Abstract

Cooperative autonomous driving requires traffic scene understanding from both vehicle and infrastructure perspectives. While vision-language models (VLMs) show strong general reasoning capabilities, their performance in safety-critical traffic scenarios remains insufficiently evaluated due to the ego-vehicle focus of existing benchmarks. To bridge this gap, we present \textbf{CrashSight}, a large-scale vision-language benchmark for roadway crash understanding using real-world roadside camera data. The dataset comprises 250 crash videos, annotated with 13K multiple-choice question-answer pairs organized under a two-tier taxonomy. Tier 1 evaluates the visual grounding of scene context and involved parties, while Tier 2 probes higher-level reasoning, including crash mechanics, causal attribution, temporal progression, and post-crash outcomes. We benchmark 8 state-of-the-art VLMs and show that, despite strong scene description capabilities, current models struggle with temporal and causal reasoning in safety-critical scenarios. We provide a detailed analysis of failure scenarios and discuss directions for improving VLM crash understanding. The benchmark provides a standardized evaluation framework for infrastructure-assisted perception in cooperative autonomous driving. The CrashSight benchmark, including the full dataset and code, is accessible at https://mcgrche.github.io/crashsight.

Problem

Research questions and friction points this paper is trying to address.

traffic crash understanding

vision-language models

infrastructure-centric perception

temporal reasoning

causal attribution

Innovation

Methods, ideas, or system contributions that make the work stand out.

infrastructure-centric

vision-language models

traffic crash reasoning