Enhancing Weakly Supervised Multimodal Video Anomaly Detection through Text Guidance

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the limitations of existing weakly supervised multimodal video anomaly detection methods, which often fail to fully exploit the semantic potential of textual modalities, struggle with fine-grained anomaly capture using generic language models, and suffer from redundancy and imbalance in multimodal fusion. To overcome these challenges, we propose a text-guided weakly supervised detection framework that enhances the semantic quality of weakly labeled text through in-context learning and introduces a multi-scale bottleneck Transformer for efficient, compact cross-modal fusion. This approach significantly improves textual representation capability while mitigating modality redundancy and imbalance. Extensive experiments demonstrate state-of-the-art performance on the UCF-Crime and XD-Violence datasets, achieving notably lower false positive rates and higher detection accuracy.

Technology Category

Application Category

📝 Abstract

Weakly supervised multimodal video anomaly detection has gained significant attention, yet the potential of the text modality remains under-explored. Text provides explicit semantic information that can enhance anomaly characterization and reduce false alarms. However, extracting effective text features is challenging due to the inability of general-purpose language models to capture anomaly-specific nuances and the scarcity of relevant descriptions. Furthermore, multimodal fusion often suffers from redundancy and imbalance. To address these issues, we propose a novel text-guided framework. First, we introduce an in-context learning-based multi-stage text augmentation mechanism to generate high-quality anomaly text samples for fine-tuning the text feature extractor. Second, we design a multi-scale bottleneck Transformer fusion module that uses compressed bottleneck tokens to progressively integrate information across modalities, mitigating redundancy and imbalance. Experiments on UCF-Crime and XD-Violence demonstrate state-of-the-art performance.

Problem

Research questions and friction points this paper is trying to address.

weakly supervised

multimodal video anomaly detection

text guidance

text feature extraction

multimodal fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

text-guided

in-context learning

multi-stage text augmentation