Reasoning-Aware Multimodal Fusion for Hateful Video Detection

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Detecting implicit hate speech in multimodal videos remains challenging due to the difficulty of modeling intricate cross-modal semantic relationships and capturing subtle hateful intent. To address this, we propose a reasoning-aware multimodal fusion framework. Our method innovatively integrates Local-Global Context Fusion (LGCF) with Semantic Cross-Attention (SCA), and introduces a three-stage adversarial reasoning mechanism to generate multi-perspective semantic representations—thereby enhancing deep contextual understanding and implicit hate intent identification. Built upon vision-language foundation models, our approach achieves significant improvements over state-of-the-art methods on two real-world benchmarks: +3% Macro-F1 and +7% recall for the hate class. These results demonstrate superior fine-grained discrimination capability and stronger cross-scenario generalization performance.

Technology Category

Application Category

📝 Abstract

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.

Problem

Research questions and friction points this paper is trying to address.

Detect hate speech in multimodal online videos

Fuse complex semantic relationships between video modalities

Understand nuanced hateful content through adversarial reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Local-Global Context Fusion captures both local and global cues

Semantic Cross Attention enables fine-grained multimodal interaction

Adversarial reasoning uses three-stage inferences to enrich understanding

🔎 Similar Papers

HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes