Machines Do See Color: A Guideline to Classify Different Forms of Racist Discourse in Large Corpora

📅 2024-01-17
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of identifying explicit and implicit racist discourse in large-scale corpora and the poor generalization of existing classification methods. We propose the first multi-stage framework integrating sociolinguistic theory with machine learning, featuring historical-spatiotemporal contextualization, cross-lingual supervised classification, and a multi-level racial discourse ontology for fine-grained detection and categorization. Innovatively, we tightly integrate conceptual and contextual modeling of racism with XLM-RoBERTa, yielding a theoretically grounded yet computationally tractable classification system. We release XLM-R-Racismo, a domain-specific pretrained model. Evaluated on Ecuadorian Indigenous-related tweets (2018–2021), our approach achieves a new state-of-the-art F1 score, improving upon prior methods by 8.2 percentage points and significantly enhancing detection of implicit racism.

Technology Category

Application Category

📝 Abstract
Current methods to identify and classify racist language in text rely on small-n qualitative approaches or large-n approaches focusing exclusively on overt forms of racist discourse. This article provides a step-by-step generalizable guideline to identify and classify different forms of racist discourse in large corpora. In our approach, we start by conceptualizing racism and its different manifestations. We then contextualize these racist manifestations to the time and place of interest, which allows researchers to identify their discursive form. Finally, we apply XLM-RoBERTa (XLM-R), a cross-lingual model for supervised text classification with a cutting-edge contextual understanding of text. We show that XLM-R and XLM-R-Racismo, our pretrained model, outperform other state-of-the-art approaches in classifying racism in large corpora. We illustrate our approach using a corpus of tweets relating to the Ecuadorian ind'igena community between 2018 and 2021.
Problem

Research questions and friction points this paper is trying to address.

Classifying diverse racist discourse forms in large text corpora
Developing generalizable guidelines for identifying racist language manifestations
Applying cross-lingual models to improve racism classification accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

XLM-R model for supervised text classification
Pretrained XLM-R-Racismo model for racism detection
Generalizable guideline for classifying racist discourse
🔎 Similar Papers
No similar papers found.
D
Diana Davila Gordillo
Lake Forest College
Joan C. Timoneda
Joan C. Timoneda
Purdue University
Comparative politicspolitical methodology
S
Sebastian Vallejo Vera
Western University