🤖 AI Summary
This work addresses the challenge of detecting subtle anomalous behaviors of distant objects in highway surveillance, which are difficult for conventional methods to identify and cause attention dispersion and high computational overhead when processed globally by vision-language models. To overcome these limitations, we propose VIBES, a novel framework that, for the first time, asynchronously integrates Bayesian online inference with vision-language models. By dynamically learning the boundaries of normal driving behavior through Bayesian modeling, VIBES asynchronously triggers and localizes spatiotemporal regions of anomaly, feeding only these focused regions into the model for efficient semantic reasoning. This approach enables adaptive attention focusing and on-demand inference, significantly improving detection accuracy across diverse highway scenarios while substantially reducing computational cost, and offers strong efficiency, interpretability, and cross-scenario generalization.
📝 Abstract
Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.