🤖 AI Summary
Detecting implicit hate speech in multimodal videos remains challenging due to the difficulty of modeling intricate cross-modal semantic relationships and capturing subtle hateful intent. To address this, we propose a reasoning-aware multimodal fusion framework. Our method innovatively integrates Local-Global Context Fusion (LGCF) with Semantic Cross-Attention (SCA), and introduces a three-stage adversarial reasoning mechanism to generate multi-perspective semantic representations—thereby enhancing deep contextual understanding and implicit hate intent identification. Built upon vision-language foundation models, our approach achieves significant improvements over state-of-the-art methods on two real-world benchmarks: +3% Macro-F1 and +7% recall for the hate class. These results demonstrate superior fine-grained discrimination capability and stronger cross-scenario generalization performance.
📝 Abstract
Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. We will release the code after the anonymity period ends.