🤖 AI Summary
This work addresses the limited interpretability of existing automated systems in multimodal hate speech detection, which hinders human-AI collaborative moderation by failing to provide fine-grained evidence such as temporal timestamps and target identities. To overcome this, the task is reframed from binary classification to structured reasoning, introducing a self-constrained cross-modal contextual mechanism that integrates vision–language and audio–language models. A cascaded reinforcement learning strategy is designed to enable cross-modal co-optimization without requiring dense frame-level annotations. The proposed approach achieves, for the first time in multimodal hate speech detection, simultaneous high-accuracy target identification and temporal localization. It significantly outperforms zero-shot and context-enhanced baselines across three benchmark datasets, attaining an F1 score of 0.73 on HateMM for target identification—representing a 30% improvement over the previous best method.
📝 Abstract
Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as"black boxes"that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.