Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

📅 2024-06-17

🏛️ 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

To address the challenges of audio-visual modality imbalance and ambiguous semantic boundaries between normal and anomalous behaviors (e.g., violence, nudity) in weakly supervised video anomaly detection, this paper proposes a novel multi-modal weakly supervised framework. The method operates without frame-level annotations, relying solely on video-level labels. Its core contributions are: (1) a Cross-Modal Fusion Adapter (CFA) that dynamically models and enhances fine-grained correlations between audio and visual features; and (2) a Hyperbolic Lorentz Graph Attention mechanism (HLGAtt), which explicitly captures hierarchical semantic relationships among normal and anomalous samples within hyperbolic space. Extensive experiments on violence and nudity detection benchmarks demonstrate consistent and significant improvements over state-of-the-art methods, achieving new SOTA performance. These results validate the effectiveness of hyperbolic geometric modeling and dynamic cross-modal fusion for weakly supervised anomaly detection.

Technology Category

Application Category

📝 Abstract

Recently, weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However, this task has substantial challenges, including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper, we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework, we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA), which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally, we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations, thereby enhancing feature separation accuracy. Through extensive experiments, we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.

Problem

Research questions and friction points this paper is trying to address.

Weakly Supervised Learning

Video Anomaly Detection

Imbalanced Information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal Fusion Adapter

Hyperbolic Lorentz Manifold Attention

Imbalanced Information Resolution

🔎 Similar Papers

MTFL: multi-timescale feature learning for weakly-supervised anomaly detection in surveillance videos