Towards Training-free Multimodal Hate Localisation with Large Language Models

📅 2026-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The proliferation of hate content in online videos poses a serious threat to individual well-being and social harmony. Existing approaches either rely heavily on extensive manual annotations or lack fine-grained temporal localization capabilities. To address these limitations, this work proposes LELA, a novel framework that, for the first time, enables scalable and interpretable hate content detection and temporal localization without any model training. LELA leverages large language models in conjunction with multimodal captions—encompassing visual, audio, OCR, music, and video contextual information—through a multi-stage prompting strategy and a cross-modal composition-matching mechanism. Evaluated on the HateMM and MultiHateClip benchmarks, LELA significantly outperforms existing training-free baselines, demonstrating its effectiveness and robustness.

Technology Category

Application Category

📝 Abstract
The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.
Problem

Research questions and friction points this paper is trying to address.

hate localization
multimodal video
training-free
temporal precision
online hate content
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
multimodal hate localization
large language models
cross-modal reasoning
temporal localization
🔎 Similar Papers
No similar papers found.
Y
Yueming Sun
Hybrid Intelligence Lab, University of Durham; Multimodal Intelligence Lab, University of Exeter
L
Long Yang
Hybrid Intelligence Lab, University of Durham
Jianbo Jiao
Jianbo Jiao
University of Birmingham | University of Oxford
Computer VisionMachine Learning
Zeyu Fu
Zeyu Fu
Lecturer, Department of Computer Science, University of Exeter
Multimedia ComputingMedical Image AnalysisAI4Science