Multimodal Zero-Shot Framework for Deepfake Hate Speech Detection in Low-Resource Languages

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Addressing the zero-shot learning challenge in detecting deepfake audio hate speech for low-resource languages, this paper proposes a multimodal cross-lingual joint alignment framework. Its core innovation is the first-ever cross-lingual audio–text contrastive learning mechanism, which constructs a shared semantic space to enable robust detection without target-language labeled data. To support this work, we release the first benchmark dataset comprising 127,000 deepfake audio–text pairs across six languages—including five low-resource Indian languages. Our method integrates contrastive learning, multimodal embedding alignment, and cross-lingual transfer. Evaluated on two multilingual test sets, it achieves accuracies of 0.819 and 0.701, significantly outperforming unimodal baselines and demonstrating strong generalization to unseen languages.

Technology Category

Application Category

📝 Abstract

This paper introduces a novel multimodal framework for hate speech detection in deepfake audio, excelling even in zero-shot scenarios. Unlike previous approaches, our method uses contrastive learning to jointly align audio and text representations across languages. We present the first benchmark dataset with 127,290 paired text and synthesized speech samples in six languages: English and five low-resource Indian languages (Hindi, Bengali, Marathi, Tamil, Telugu). Our model learns a shared semantic embedding space, enabling robust cross-lingual and cross-modal classification. Experiments on two multilingual test sets show our approach outperforms baselines, achieving accuracies of 0.819 and 0.701, and generalizes well to unseen languages. This demonstrates the advantage of combining modalities for hate speech detection in synthetic media, especially in low-resource settings where unimodal models falter. The Dataset is available at https://www.iab-rubric.org/resources.

Problem

Research questions and friction points this paper is trying to address.

Detecting hate speech in deepfake audio across languages

Addressing low-resource language challenges in hate speech detection

Enabling cross-modal classification with shared semantic embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework for deepfake hate speech detection

Contrastive learning aligns audio and text representations

Shared semantic embedding space enables cross-lingual classification

🔎 Similar Papers

Exploring the Limits of Zero Shot Vision Language Models for Hate Meme Detection: The Vulnerabilities and their Interpretations

2024-02-19Citations: 3