Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

To address critical challenges in hate speech detection for low-resource Iberian languages—namely data scarcity, unmodeled dialectal diversity, and the absence of cross-lingual benchmarks—this work introduces the first unified, dialect-aligned multilingual corpus covering European Spanish, Galician, and European Portuguese, including their major regional variants. We propose a novel “dialect-aware” multilingual modeling paradigm and systematically evaluate the cross-lingual transfer capabilities of large language models under zero-shot, few-shot, and fine-tuning settings. The project releases a high-quality, open-source dataset and establishes the first dedicated benchmark for hate speech detection in Iberian languages. Empirical results across multiple experimental configurations demonstrate that explicit dialectal modeling significantly improves performance. This work fills both methodological and resource gaps in low-resource, multi-dialectal NLP scenarios.

Technology Category

Application Category

📝 Abstract

Hate speech poses a serious threat to social cohesion and individual well-being, particularly on social media, where it spreads rapidly. While research on hate speech detection has progressed, it remains largely focused on English, resulting in limited resources and benchmarks for low-resource languages. Moreover, many of these languages have multiple linguistic varieties, a factor often overlooked in current approaches. At the same time, large language models require substantial amounts of data to perform reliably, a requirement that low-resource languages often cannot meet. In this work, we address these gaps by compiling a meta-collection of hate speech datasets for European Spanish, standardised with unified labels and metadata. This collection is based on a systematic analysis and integration of existing resources, aiming to bridge the data gap and support more consistent and scalable hate speech detection. We extended this collection by translating it into European Portuguese and into a Galician standard that is more convergent with Spanish and another Galician variant that is more convergent with Portuguese, creating aligned multilingual corpora. Using these resources, we establish new benchmarks for hate speech detection in Iberian languages. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, providing baseline results for future research. Moreover, we perform a cross-lingual analysis with our target languages. Our findings underscore the importance of multilingual and variety-aware approaches in hate speech detection and offer a foundation for improved benchmarking in underrepresented European languages.

Problem

Research questions and friction points this paper is trying to address.

Addressing hate speech detection gaps in low-resource Iberian languages

Creating standardized multilingual datasets for European Spanish and Portuguese

Establishing benchmarks and evaluating LLMs for underrepresented language varieties

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiled meta-collection with unified hate speech labels

Translated datasets into aligned multilingual Iberian language corpora

Evaluated large language models across zero-shot and fine-tuning settings

🔎 Similar Papers

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets