Few-Shot Contrastive Adaptation for Audio Abuse Detection in Low-Resource Indic Languages

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenge of detecting abusive speech in low-resource Indo-Aryan languages, where performance is hindered by automatic speech recognition errors and loss of prosodic cues. The study presents the first application of the Contrastive Language–Audio Pretraining (CLAP) model to this task, introducing a lightweight few-shot supervised contrastive adaptation method that enables direct cross-lingual abuse detection from raw audio. Through zero-shot prompting and cross-lingual evaluation—including leave-one-language-out settings—across ten Indo-Aryan languages, the approach demonstrates CLAP’s strong cross-lingual audio representation capabilities. Remarkably, with only minimal labeled data and a lightweight projection adapter, the method achieves performance comparable to fully supervised systems, while also revealing significant language-dependent variation in few-shot transfer effectiveness.

Technology Category

Application Category

📝 Abstract

Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classification, but this pipeline is vulnerable to transcription errors and discards prosodic information carried in speech. We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. Using the ADIMA dataset, we evaluate CLAP-based representations under few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting included as an auxiliary analysis. Our results show that CLAP yields strong cross-lingual audio representations across ten Indic languages, and that lightweight projection-only adaptation achieves competitive performance with respect to fully supervised systems trained on complete training data. However, the benefits of few-shot adaptation are language-dependent and not monotonic with shot size. These findings suggest that contrastive audio-text models provide a promising basis for cross-lingual audio abuse detection in low-resource settings, while also indicating that transfer remains incomplete and language-specific in important ways.

Problem

Research questions and friction points this paper is trying to address.

audio abuse detection

low-resource languages

few-shot learning

cross-lingual adaptation

Indic languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive Language-Audio Pre-training

Few-Shot Adaptation

Cross-lingual Transfer