🤖 AI Summary
Addressing the challenging problem of long-horizon, multimodal, and rare-event modeling for urban traffic accident risk prediction, this paper introduces the first adaptive long-context multimodal foundation model. Methodologically, we propose a volatility-driven dynamic window selection mechanism, integrating shallow cross-attention, local graph attention networks (GATs), and a sparse global BigBird Transformer. We represent spatiotemporal structures via H3 hexagonal tiling and enhance calibration and generalization using Monte Carlo Dropout and class-weighted loss. Evaluated across 15 U.S. cities in cross-city experiments, our model achieves 0.94 accuracy, 0.92 F1-score, and only 0.04 expected calibration error (ECE), significantly outperforming over 20 state-of-the-art baselines. To our knowledge, this is the first approach to achieve high accuracy, strong reliability, and robust transferability for long-term traffic accident risk modeling.
📝 Abstract
Traffic accidents are rare, yet high-impact events that require long-context multimodal reasoning for accurate risk forecasting. In this paper, we introduce ALCo-FM, a unified adaptive long-context foundation model that computes a volatility pre-score to dynamically select context windows for input data and encodes and fuses these multimodal data via shallow cross attention. Following a local GAT layer and a BigBird-style sparse global transformer over H3 hexagonal grids, coupled with Monte Carlo dropout for confidence, the model yields superior, well-calibrated predictions. Trained on data from 15 US cities with a class-weighted loss to counter label imbalance, and fine-tuned with minimal data on held-out cities, ALCo-FM achieves 0.94 accuracy, 0.92 F1, and an ECE of 0.04, outperforming more than 20 state-of-the-art baselines in large-scale urban risk prediction. Code and dataset are available at: https://github.com/PinakiPrasad12/ALCo-FM