Enhancing Vision-Language Models with Scene Graphs for Traffic Accident Understanding

📅 2024-07-08
🏛️ 2024 IEEE International Automated Vehicle Validation Conference (IAVVC)
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Fine-grained classification of traffic accidents remains challenging due to complex spatial configurations and semantic ambiguities. Method: This paper proposes the first structured understanding framework integrating scene graph modeling with multimodal representation learning. It formalizes traffic scenes as object–relation graphs—where nodes represent traffic entities and edges encode geometric or semantic spatial relationships—and jointly encodes visual (ViT), linguistic (CLIP), and graph-structured modalities via cross-stage alignment to enable geometry-aware, semantically grounded reasoning. Contribution/Results: It is the first work to introduce structured scene graphs into traffic anomaly understanding and designs a novel graph–vision–language feature fusion mechanism with balanced accuracy optimization. Evaluated on the DoTA subset, our method achieves 57.77% balanced accuracy—outperforming the strongest baseline by 4.8 percentage points—demonstrating the efficacy of incorporating structural priors and multimodal协同 modeling for fine-grained traffic accident analysis.

Technology Category

Application Category

📝 Abstract
Recognizing a traffic accident is an essential part of any autonomous driving or road monitoring system. An accident can appear in a wide variety of forms, and understanding what type of accident is taking place may be useful to prevent it from reoccurring. This work focuses on classifying traffic scenes into specific accident types. We approach the problem by representing a traffic scene as a graph, where objects such as cars can be represented as nodes, and relative distances and directions between them as edges. This representation of a traffic scene is referred to as a scene graph, and can be used as input for an accident classifier. Better results are obtained with a classifier that fuses the scene graph input with visual and textual representations. This work introduces a multi-stage, multimodal pipeline that pre-processes videos of traffic accidents, encodes them as scene graphs, and aligns this representation with vision and language modalities before executing the classification task. When trained on 4 classes, our method achieves a balanced accuracy score of 57.77% on an (unbalanced) subset of the popular Detection of Traffic Anomaly (DoTA) benchmark, representing an increase of close to 5 percentage points from the case where scene graph information is not taken into account.
Problem

Research questions and friction points this paper is trying to address.

Traffic Accident Analysis
Autonomous Driving
Road Surveillance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scene Graph
Multimodal Information Fusion
Traffic Accident Detection
🔎 Similar Papers
No similar papers found.
A
Aaron Lohner
Carnegie Mellon University, Pittsburgh, USA
F
Francesco Compagno
University of Trento, Trento, Italy
Jonathan Francis
Jonathan Francis
Carnegie Mellon University, Bosch Center for Artificial Intelligence
Multimodal Machine LearningRobot LearningArtificial IntelligenceSensing
A
A. Oltramari
Bosch Center for Artificial Intelligence; Carnegie Mellon University, Pittsburgh, USA