Don't Erase, Inform! Detecting and Contextualizing Harmful Language in Cultural Heritage Collections

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Historical biases and offensive terminology in cultural heritage metadata pose a dilemma: outright removal compromises archival integrity, while retention perpetuates discrimination. This study proposes a “preserve-and-recontextualize” paradigm, developing a multilingual AI toolchain co-designed with marginalized communities to curate sensitive terminology. The pipeline integrates rule-based matching, fine-tuned BERT models, and prompt-engineered large language models (LLMs) to automatically detect contested terms and generate dual-dimensional interpretations—situated both historically and contemporaneously. The system is deployed via a web application and integrated with cultural heritage (CH) platform APIs. Applied to 7.9 million metadata records, it has accurately identified and recontextualized thousands of problematic terms. Adopted by multiple international institutions, the framework advances ethical metadata governance, significantly enhancing collection inclusivity, accessibility, and capacity for historical reflexivity.

Technology Category

Application Category

📝 Abstract
Cultural Heritage (CH) data hold invaluable knowledge, reflecting the history, traditions, and identities of societies, and shaping our understanding of the past and present. However, many CH collections contain outdated or offensive descriptions that reflect historical biases. CH Institutions (CHIs) face significant challenges in curating these data due to the vast scale and complexity of the task. To address this, we develop an AI-powered tool that detects offensive terms in CH metadata and provides contextual insights into their historical background and contemporary perception. We leverage a multilingual vocabulary co-created with marginalized communities, researchers, and CH professionals, along with traditional NLP techniques and Large Language Models (LLMs). Available as a standalone web app and integrated with major CH platforms, the tool has processed over 7.9 million records, contextualizing the contentious terms detected in their metadata. Rather than erasing these terms, our approach seeks to inform, making biases visible and providing actionable insights for creating more inclusive and accessible CH collections.
Problem

Research questions and friction points this paper is trying to address.

Detecting harmful language in cultural heritage metadata
Providing historical context for offensive terms
Supporting inclusive curation of heritage collections
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI tool detects offensive terms in metadata
Multilingual vocabulary co-created with communities
Integrates with major CH platforms
🔎 Similar Papers
No similar papers found.
O
O. M. Mastromichalakis
National Technical University of Athens
Jason Liartis
Jason Liartis
PhD Student, NTUA
XAIInterpretable Machine LearningKRR
K
Kristina Rose
DFF - Deutsches Filminstitut & Filmmuseum
Antoine Isaac
Antoine Isaac
Europeana & VU University Amsterdam
Linked DataCultural HeritageSemantic Web
G
G. Stamou
National Technical University of Athens