🤖 AI Summary
Existing hate video detection methods often overlook subtle yet critical hateful segments and struggle to systematically model intra-modal and inter-modal structural relationships, leading to suboptimal multimodal fusion and poor interpretability. To address these limitations, we propose a dual-stream graph neural network (GNN) framework: (1) an instance graph explicitly captures structural dependencies among video segments; (2) a complementary weight graph dynamically models the hateful relevance of each segment. By decoupling instance features from attention weights, our method enables fine-grained hateful segment localization and structured cross-modal representation learning. Integrating GNNs, instance-level segmentation, and attention-based weighting, our approach achieves state-of-the-art performance on mainstream public benchmarks, with significant gains in detection accuracy and strong model interpretability. The source code is publicly available.
📝 Abstract
Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video's category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: https://github.com/Multimodal-Intelligence-Lab-MIL/MultiHateGNN.