Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing weakly supervised multimodal video anomaly detection methods, which often fail to fully exploit the semantic potential of textual modalities, struggle with fine-grained anomaly capture using generic language models, and suffer from redundancy and imbalance in multimodal fusion. To overcome these challenges, we propose a text-guided weakly supervised detection framework that enhances the semantic quality of weakly labeled text through in-context learning and introduces a multi-scale bottleneck Transformer for efficient, compact cross-modal fusion. This approach significantly improves textual representation capability while mitigating modality redundancy and imbalance. Extensive experiments demonstrate state-of-the-art performance on the UCF-Crime and XD-Violence datasets, achieving notably lower false positive rates and higher detection accuracy.

Technology Category

Application Category

📝 Abstract
Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

weakly supervised
multimodal video anomaly detection
text guidance
text feature extraction
multimodal fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

text-guided
in-context learning
multi-stage text augmentation
bottleneck Transformer
multimodal fusion
🔎 Similar Papers
No similar papers found.
S
Shengyang Sun
School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
J
Jiashen Hua
Alibaba Cloud, Hangzhou, China
Junyi Feng
Junyi Feng
Zhejiang University
Computer VisionMachine LearningSegmentation
Xiaojin Gong
Xiaojin Gong
Zhejiang University
Computer VisionImage ProcessingArtificial Intelligence