Multimodal Hate Detection Using Dual-Stream Graph Neural Networks

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing hate video detection methods often overlook subtle yet critical hateful segments and struggle to systematically model intra-modal and inter-modal structural relationships, leading to suboptimal multimodal fusion and poor interpretability. To address these limitations, we propose a dual-stream graph neural network (GNN) framework: (1) an instance graph explicitly captures structural dependencies among video segments; (2) a complementary weight graph dynamically models the hateful relevance of each segment. By decoupling instance features from attention weights, our method enables fine-grained hateful segment localization and structured cross-modal representation learning. Integrating GNNs, instance-level segmentation, and attention-based weighting, our approach achieves state-of-the-art performance on mainstream public benchmarks, with significant gains in detection accuracy and strong model interpretability. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video's category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: https://github.com/Multimodal-Intelligence-Lab-MIL/MultiHateGNN.

Problem

Research questions and friction points this paper is trying to address.

Detecting hateful content in multimodal videos effectively

Addressing neglect of minimal hateful content in classification

Systematically capturing structured relationships across video modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream graph neural network model

Instance graph separates video components

Weight graph assigns importance to features

🔎 Similar Papers

No similar papers found.