Machines Do See Color: A Guideline to Classify Different Forms of Racist Discourse in Large Corpora

📅 2024-01-17

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the challenges of identifying explicit and implicit racist discourse in large-scale corpora and the poor generalization of existing classification methods. We propose the first multi-stage framework integrating sociolinguistic theory with machine learning, featuring historical-spatiotemporal contextualization, cross-lingual supervised classification, and a multi-level racial discourse ontology for fine-grained detection and categorization. Innovatively, we tightly integrate conceptual and contextual modeling of racism with XLM-RoBERTa, yielding a theoretically grounded yet computationally tractable classification system. We release XLM-R-Racismo, a domain-specific pretrained model. Evaluated on Ecuadorian Indigenous-related tweets (2018–2021), our approach achieves a new state-of-the-art F1 score, improving upon prior methods by 8.2 percentage points and significantly enhancing detection of implicit racism.

Technology Category

Application Category

📝 Abstract

Current methods to identify and classify racist language in text rely on small-n qualitative approaches or large-n approaches focusing exclusively on overt forms of racist discourse. This article provides a step-by-step generalizable guideline to identify and classify different forms of racist discourse in large corpora. In our approach, we start by conceptualizing racism and its different manifestations. We then contextualize these racist manifestations to the time and place of interest, which allows researchers to identify their discursive form. Finally, we apply XLM-RoBERTa (XLM-R), a cross-lingual model for supervised text classification with a cutting-edge contextual understanding of text. We show that XLM-R and XLM-R-Racismo, our pretrained model, outperform other state-of-the-art approaches in classifying racism in large corpora. We illustrate our approach using a corpus of tweets relating to the Ecuadorian ind'igena community between 2018 and 2021.

Problem

Research questions and friction points this paper is trying to address.

Classifying diverse racist discourse forms in large text corpora

Developing generalizable guidelines for identifying racist language manifestations

Applying cross-lingual models to improve racism classification accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

XLM-R model for supervised text classification

Pretrained XLM-R-Racismo model for racism detection

Generalizable guideline for classifying racist discourse

🔎 Similar Papers

No similar papers found.