🤖 AI Summary
This work addresses the limitations of existing weakly supervised multimodal video anomaly detection methods, which often fail to fully exploit the semantic potential of textual modalities, struggle with fine-grained anomaly capture using generic language models, and suffer from redundancy and imbalance in multimodal fusion. To overcome these challenges, we propose a text-guided weakly supervised detection framework that enhances the semantic quality of weakly labeled text through in-context learning and introduces a multi-scale bottleneck Transformer for efficient, compact cross-modal fusion. This approach significantly improves textual representation capability while mitigating modality redundancy and imbalance. Extensive experiments demonstrate state-of-the-art performance on the UCF-Crime and XD-Violence datasets, achieving notably lower false positive rates and higher detection accuracy.
📝 Abstract
Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.