Don't Erase, Inform! Detecting and Contextualizing Harmful Language in Cultural Heritage Collections

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Historical biases and offensive terminology in cultural heritage metadata pose a dilemma: outright removal compromises archival integrity, while retention perpetuates discrimination. This study proposes a “preserve-and-recontextualize” paradigm, developing a multilingual AI toolchain co-designed with marginalized communities to curate sensitive terminology. The pipeline integrates rule-based matching, fine-tuned BERT models, and prompt-engineered large language models (LLMs) to automatically detect contested terms and generate dual-dimensional interpretations—situated both historically and contemporaneously. The system is deployed via a web application and integrated with cultural heritage (CH) platform APIs. Applied to 7.9 million metadata records, it has accurately identified and recontextualized thousands of problematic terms. Adopted by multiple international institutions, the framework advances ethical metadata governance, significantly enhancing collection inclusivity, accessibility, and capacity for historical reflexivity.

Technology Category

Application Category

📝 Abstract

Cultural Heritage (CH) data hold invaluable knowledge, reflecting the history, traditions, and identities of societies, and shaping our understanding of the past and present. However, many CH collections contain outdated or offensive descriptions that reflect historical biases. CH Institutions (CHIs) face significant challenges in curating these data due to the vast scale and complexity of the task. To address this, we develop an AI-powered tool that detects offensive terms in CH metadata and provides contextual insights into their historical background and contemporary perception. We leverage a multilingual vocabulary co-created with marginalized communities, researchers, and CH professionals, along with traditional NLP techniques and Large Language Models (LLMs). Available as a standalone web app and integrated with major CH platforms, the tool has processed over 7.9 million records, contextualizing the contentious terms detected in their metadata. Rather than erasing these terms, our approach seeks to inform, making biases visible and providing actionable insights for creating more inclusive and accessible CH collections.

Problem

Research questions and friction points this paper is trying to address.

Detecting harmful language in cultural heritage metadata

Providing historical context for offensive terms

Supporting inclusive curation of heritage collections

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI tool detects offensive terms in metadata

Multilingual vocabulary co-created with communities

Integrates with major CH platforms

🔎 Similar Papers

Machines Do See Color: A Guideline to Classify Different Forms of Racist Discourse in Large Corpora