🤖 AI Summary
To address the low accuracy, high computational overhead, and poor interpretability of real-time traffic accident prediction in autonomous driving, this paper proposes a domain-enhanced dual-branch multimodal fusion model. The model separately processes driving videos (using Long-CLIP) and structured accident texts (semantically parsed via GPT-4o prompt engineering), enabling efficient cross-modal collaborative modeling through cross-modal feature alignment and a lightweight fusion mechanism. Innovatively, it incorporates traffic-domain knowledge-constrained prompt templates and a hierarchical attention aggregation strategy, significantly reducing inference latency while preserving model interpretability. Evaluated on three major benchmarks—DAD, CCD, and A3D—the method achieves state-of-the-art performance with fewer parameters, improving average accuracy by 4.2% and F1-score by 5.7%. This work establishes a new paradigm for real-time, trustworthy accident early warning systems.
📝 Abstract
Developing precise and computationally efficient traffic accident anticipation system is crucial for contemporary autonomous driving technologies, enabling timely intervention and loss prevention. In this paper, we propose an accident anticipation framework employing a dual-branch architecture that effectively integrates visual information from dashcam videos with structured textual data derived from accident reports. Furthermore, we introduce a feature aggregation method that facilitates seamless integration of multimodal inputs through large models (GPT-4o, Long-CLIP), complemented by targeted prompt engineering strategies to produce actionable feedback and standardized accident archives. Comprehensive evaluations conducted on benchmark datasets (DAD, CCD, and A3D) validate the superior predictive accuracy, enhanced responsiveness, reduced computational overhead, and improved interpretability of our approach, thus establishing a new benchmark for state-of-the-art performance in traffic accident anticipation.