🤖 AI Summary
This work addresses the challenges of identifying explicit and implicit racist discourse in large-scale corpora and the poor generalization of existing classification methods. We propose the first multi-stage framework integrating sociolinguistic theory with machine learning, featuring historical-spatiotemporal contextualization, cross-lingual supervised classification, and a multi-level racial discourse ontology for fine-grained detection and categorization. Innovatively, we tightly integrate conceptual and contextual modeling of racism with XLM-RoBERTa, yielding a theoretically grounded yet computationally tractable classification system. We release XLM-R-Racismo, a domain-specific pretrained model. Evaluated on Ecuadorian Indigenous-related tweets (2018–2021), our approach achieves a new state-of-the-art F1 score, improving upon prior methods by 8.2 percentage points and significantly enhancing detection of implicit racism.
📝 Abstract
Current methods to identify and classify racist language in text rely on small-n qualitative approaches or large-n approaches focusing exclusively on overt forms of racist discourse. This article provides a step-by-step generalizable guideline to identify and classify different forms of racist discourse in large corpora. In our approach, we start by conceptualizing racism and its different manifestations. We then contextualize these racist manifestations to the time and place of interest, which allows researchers to identify their discursive form. Finally, we apply XLM-RoBERTa (XLM-R), a cross-lingual model for supervised text classification with a cutting-edge contextual understanding of text. We show that XLM-R and XLM-R-Racismo, our pretrained model, outperform other state-of-the-art approaches in classifying racism in large corpora. We illustrate our approach using a corpus of tweets relating to the Ecuadorian ind'igena community between 2018 and 2021.