AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal models lack systematic evaluation of spatiotemporal and intent reasoning capabilities in safety-critical dynamic scenarios—such as terrestrial, aerial, and maritime traffic accidents—primarily due to the absence of a physics-grounded, cross-domain unified benchmark. To address this gap, we introduce AccidentBench: the first large-scale multimodal benchmark covering all three transportation domains, comprising ~2,000 real-world accident videos and 19,000 fine-grained question-answer pairs. We propose a novel cross-domain, multi-level evaluation framework—orthogonally structured along temporal duration, reasoning difficulty, and task type—that systematically assesses temporal, spatial, and intent reasoning. Extensive experiments reveal that state-of-the-art models—including Gemini-2.5 Pro and GPT-5—achieve only ~18% accuracy on the most challenging tasks, exposing fundamental limitations in long-video understanding and complex dynamic reasoning. AccidentBench thus establishes a rigorous, domain-general standard for evaluating multimodal reasoning under physical realism and temporal complexity.

Technology Category

Application Category

📝 Abstract
Rapid advances in multimodal models demand benchmarks that rigorously evaluate understanding and reasoning in safety-critical, dynamic real-world settings. We present AccidentBench, a large-scale benchmark that combines vehicle accident scenarios with Beyond domains, safety-critical settings in air and water that emphasize spatial and temporal reasoning (e.g., navigation, orientation, multi-vehicle motion). The benchmark contains approximately 2000 videos and over 19000 human-annotated question--answer pairs spanning multiple video lengths (short/medium/long) and difficulty levels (easy/medium/hard). Tasks systematically probe core capabilities: temporal, spatial, and intent understanding and reasoning. By unifying accident-centric traffic scenes with broader safety-critical scenarios in air and water, AccidentBench offers a comprehensive, physically grounded testbed for evaluating models under real-world variability. Evaluations of state-of-the-art models (e.g., Gemini-2.5 Pro and GPT-5) show that even the strongest models achieve only about 18% accuracy on the hardest tasks and longest videos, revealing substantial gaps in real-world temporal, spatial, and intent reasoning. AccidentBench is designed to expose these critical gaps and drive the development of multimodal models that are safer, more robust, and better aligned with real-world safety-critical challenges. The code and dataset are available at: https://github.com/SafeRL-Lab/AccidentBench
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal models in safety-critical real-world scenarios
Assessing spatial and temporal reasoning in dynamic environments
Testing intent understanding across diverse accident and navigation contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines vehicle accident scenarios with air and water domains
Uses approximately 2000 videos and 19000 question-answer pairs
Systematically probes temporal, spatial, and intent reasoning capabilities
🔎 Similar Papers
No similar papers found.