MultiHateLoc: Towards Temporal Localisation of Multimodal Hate Content in Online Videos

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Fine-grained temporal localization of multimodal hateful content in online videos remains challenging, as existing methods are restricted to video-level classification and fail to model cross-modal temporal dynamics under only video-level weak supervision. Method: We propose the first weakly supervised multimodal hateful content temporal localization framework. It integrates a modality-aware temporal encoder, a dynamically weighted cross-modal fusion mechanism, and a cross-modal contrastive alignment strategy, coupled with text-enhanced preprocessing and modality-specific multiple-instance learning objectives. Contribution/Results: Our approach achieves state-of-the-art performance on temporal localization tasks on the HateMM and MultiHateClip benchmarks. It generates interpretable frame-level predictions and significantly outperforms video-level classification baselines. By enabling precise, weakly supervised, fine-grained analysis of multimodal hateful content, our framework establishes a novel paradigm for multimodal content safety assessment.

Technology Category

Application Category

📝 Abstract

The rapid growth of video content on platforms such as TikTok and YouTube has intensified the spread of multimodal hate speech, where harmful cues emerge subtly and asynchronously across visual, acoustic, and textual streams. Existing research primarily focuses on video-level classification, leaving the practically crucial task of temporal localisation, identifying when hateful segments occur, largely unaddressed. This challenge is even more noticeable under weak supervision, where only video-level labels are available, and static fusion or classification-based architectures struggle to capture cross-modal and temporal dynamics. To address these challenges, we propose MultiHateLoc, the first framework designed for weakly-supervised multimodal hate localisation. MultiHateLoc incorporates (1) modality-aware temporal encoders to model heterogeneous sequential patterns, including a tailored text-based preprocessing module for feature enhancement; (2) dynamic cross-modal fusion to adaptively emphasise the most informative modality at each moment and a cross-modal contrastive alignment strategy to enhance multimodal feature consistency; (3) a modality-aware MIL objective to identify discriminative segments under video-level supervision. Despite relying solely on coarse labels, MultiHateLoc produces fine-grained, interpretable frame-level predictions. Experiments on HateMM and MultiHateClip show that our method achieves state-of-the-art performance in the localisation task.

Problem

Research questions and friction points this paper is trying to address.

Localizes multimodal hate segments in videos temporally

Addresses weak supervision with only video-level labels

Models cross-modal dynamics for accurate hate detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-aware temporal encoders model heterogeneous sequential patterns

Dynamic cross-modal fusion adaptively emphasizes informative modalities

Modality-aware MIL objective identifies discriminative segments under weak supervision

🔎 Similar Papers

HateSieve: A Contrastive Learning Framework for Detecting and Segmenting Hateful Content in Multimodal Memes