MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key challenges in traffic accident prediction from dashcam views—frequent occlusion of road users and asynchronous, complex behavioral cues—this paper proposes a Multi-Scale Feature Interaction Network (MSFIN). MSFIN jointly models short-, medium-, and long-term spatiotemporal dependencies and implicit interactions under occlusion via hierarchical feature aggregation, causally constrained temporal evolution modeling, and cross-scale Transformer-based interaction. It introduces an integrated architecture unifying multi-scale feature extraction, interaction, and post-fusion, enabling joint risk representation learning from both scene-level and object-level features. Evaluated on the DAD and DADA benchmarks, MSFIN achieves state-of-the-art performance: it significantly outperforms existing single-scale methods in both prediction accuracy and mean early-warning lead time (+1.2 seconds), demonstrating the effectiveness and practicality of multi-scale asynchronous interaction modeling for dashcam-based accident prediction.

Technology Category

Application Category

📝 Abstract
With the widespread deployment of dashcams and advancements in computer vision, developing accident prediction models from the dashcam perspective has become critical for proactive safety interventions. However, two key challenges persist: modeling feature-level interactions among traffic participants (often occluded in dashcam views) and capturing complex, asynchronous multi-temporal behavioral cues preceding accidents. To deal with these two challenges, a Multi-scale Feature Interaction Network (MsFIN) is proposed for early-stage accident anticipation from dashcam videos. MsFIN has three layers for multi-scale feature aggregation, temporal feature processing and multi-scale feature post fusion, respectively. For multi-scale feature aggregation, a Multi-scale Module is designed to extract scene representations at short-term, mid-term and long-term temporal scales. Meanwhile, the Transformer architecture is leveraged to facilitate comprehensive feature interactions. Temporal feature processing captures the sequential evolution of scene and object features under causal constraints. In the multi-scale feature post fusion stage, the network fuses scene and object features across multiple temporal scales to generate a comprehensive risk representation. Experiments on DAD and DADA datasets show that MsFIN significantly outperforms state-of-the-art models with single-scale feature extraction in both prediction correctness and earliness. Ablation studies validate the effectiveness of each module in MsFIN, highlighting how the network achieves superior performance through multi-scale feature fusion and contextual interaction modeling.
Problem

Research questions and friction points this paper is trying to address.

Modeling feature-level interactions among occluded traffic participants
Capturing complex asynchronous multi-temporal behavioral cues
Improving early accident anticipation from dashcam video perspectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale temporal feature aggregation at short/mid/long-term scales
Transformer architecture for comprehensive feature interactions modeling
Multi-scale post fusion of scene and object features
🔎 Similar Papers
No similar papers found.
T
Tongshuai Wu
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
C
Chao Lu
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
Z
Ze Song
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
Y
Yunlong Lin
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
S
Sizhe Fan
School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China
Xuemei Chen
Xuemei Chen
University of North Carolina Wilmington