Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion

๐Ÿ“… 2024-12-09
๐Ÿ›๏ธ 2024 IEEE International Conference on Data Mining Workshops (ICDMW)
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Detecting implicit hate videos in short-video platforms remains challenging due to their subtle semantic cues, weak cross-modal alignments, and critical temporal dynamics. To address this, we propose CMFusionโ€”a novel multimodal fusion model featuring dual-path (channel-level and modality-level) integration. CMFusion jointly models text, audio, and visual modalities via video-audio temporal cross-attention, channel-wise recalibration, and modality-gated fusion. Unlike prevailing unimodal or shallow multimodal approaches, it explicitly captures intra-modal temporal evolution and fine-grained inter-modal interactions. On a real-world dataset, CMFusion consistently outperforms five strong baselines across accuracy, precision, recall, and F1-score. Ablation studies quantitatively validate the contribution of each component. This work establishes an interpretable and robust multimodal paradigm for implicit hate content detection, advancing both methodology and practical deployment.

Technology Category

Application Category

๐Ÿ“ Abstract
The rapid rise of video content on platforms such as TikTok and YouTube has transformed information dissemination, but it has also facilitated the spread of harmful content, particularly hate videos. Despite significant efforts to combat hate speech, detecting these videos remains challenging due to their often implicit nature. Current detection methods primarily rely on unimodal approaches, which inadequately capture the complementary features across different modalities. While multimodal techniques offer a broader perspective, many fail to effectively integrate temporal dynamics and modality-wise interactions essential for identifying nuanced hate content. In this paper, we present CMFusion, an enhanced multimodal hate video detection model utilizing a novel Channel-wise and Modality-wise Fusion Mechanism. CMFusion first extracts features from text, audio, and video modalities using pre-trained models and then incorporates a temporal cross-attention mechanism to capture dependencies between video and audio streams. The learned features are then processed by channel-wise and modality-wise fusion modules to obtain informative representations of videos. Our extensive experiments on a real-world dataset demonstrate that CMFusion significantly outperforms five widely used baselines in terms of accuracy, precision, recall, and F1 score. Comprehensive ablation studies and parameter analyses further validate our design choices, highlighting the modelโ€™s effectiveness in detecting hate videos. The source codes will be made publicly available at https://github.com/EvelynZ10/cmfusion.
Problem

Research questions and friction points this paper is trying to address.

Detecting implicit hate videos on platforms like TikTok
Improving multimodal fusion for hate video detection
Capturing temporal dynamics in hate video content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Channel-wise and Modality-wise Fusion Mechanism
Temporal cross-attention for video-audio dependencies
Pre-trained models for text, audio, video features