Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference

📅 2026-04-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the challenge of detecting subtle anomalous behaviors of distant objects in highway surveillance, which are difficult for conventional methods to identify and cause attention dispersion and high computational overhead when processed globally by vision-language models. To overcome these limitations, we propose VIBES, a novel framework that, for the first time, asynchronously integrates Bayesian online inference with vision-language models. By dynamically learning the boundaries of normal driving behavior through Bayesian modeling, VIBES asynchronously triggers and localizes spatiotemporal regions of anomaly, feeding only these focused regions into the model for efficient semantic reasoning. This approach enables adaptive attention focusing and on-demand inference, significantly improving detection accuracy across diverse highway scenarios while substantially reducing computational cost, and offers strong efficiency, interpretability, and cross-scenario generalization.

Technology Category

Application Category

📝 Abstract
Expressway video anomaly detection is essential for safety management. However, identifying anomalies across diverse scenes remains challenging, particularly for far-field targets exhibiting subtle abnormal vehicle motions. While Vision-Language Models (VLMs) demonstrate strong semantic reasoning capabilities, processing global frames causes attention dilution for these far-field objects and incurs prohibitive computational costs. To address these issues, we propose VIBES, an asynchronous collaborative framework utilizing VLMs guided by Bayesian inference. Specifically, to overcome poor generalization across varying expressway environments, we introduce an online Bayesian inference module. This module continuously evaluates vehicle trajectories to dynamically update the probabilistic boundaries of normal driving behaviors, serving as an asynchronous trigger to precisely localize anomalies in space and time. Instead of processing the continuous video stream, the VLM processes only the localized visual regions indicated by the trigger. This targeted visual input prevents attention dilution and enables accurate semantic reasoning. Extensive evaluations demonstrate that VIBES improves detection accuracy for far-field anomalies and reduces computational overhead, achieving high real-time efficiency and explainability while demonstrating generalization across diverse expressway conditions.
Problem

Research questions and friction points this paper is trying to address.

far-field anomaly detection
expressway surveillance
Vision-Language Models
attention dilution
computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian inference
Vision-Language Models
far-field anomaly detection
asynchronous triggering
trajectory-based reasoning
Xiaowei Mao
Xiaowei Mao
Beijing Jiaotong University
B
Bowen Sui
School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
Weijie Zhang
Weijie Zhang
University of Kansas Medical Center
Inverse planningparticle therapy
Yawen Yang
Yawen Yang
Tsinghua University
Deep learningNatural Language Processing
Shengnan Guo
Shengnan Guo
Beijing Jiaotong University
Spatial-Temporal Data Mining
Shilong Zhao
Shilong Zhao
University of Chinese Academy of Sciences
J
Jiaqi Lin
School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
T
Tingrui Wu
School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China
Y
Youfang Lin
School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China; Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing, China
H
Huaiyu Wan
School of Computer Science and Technology, Beijing Jiaotong University, Beijing, China; Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, Beijing, China