What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

📅 2025-07-31
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether proprietary large language models (LLMs) implicitly sanitize sensitive content—even without explicit instructions or fine-tuning—revealing potential intrinsic value alignment. Method: Using GPT-4o-mini as a testbed, we conduct zero-shot sensitivity classification and empirical paraphrasing experiments to systematically quantify the degree to which the model attenuates derogatory and taboo language during rewriting. Contribution/Results: We provide the first empirical evidence that LLMs exhibit inherent content moderation tendencies despite lacking domain-specific sensitive-content training: rewritten outputs show statistically significant reductions in sensitivity scores and marked decreases in derogatory/taboo lexical usage. Moreover, the model’s zero-shot sensitivity classification accuracy surpasses that of conventional baseline models. These findings uncover latent value-alignment mechanisms in black-box LLMs, offering novel insights into implicit safety boundaries and the embeddedness of normative values in generative AI systems.

Technology Category

Application Category

📝 Abstract
Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' implicit moderation of sensitive content
Analyzes GPT-4o-mini's paraphrasing behavior for sensitivity shifts
Compares LLMs' zero-shot sensitivity classification with traditional methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical analysis of implicit content moderation
Evaluates sensitivity shifts in paraphrasing
Zero-shot classification of sentence sensitivity
🔎 Similar Papers
No similar papers found.
Alfio Ferrara
Alfio Ferrara
Dipartimento di Informatica, UniversitĂ  degli Studi di Milano
data sciencenatural language processingdigital humanities
S
Sergio Picascia
UniversitĂ  degli Studi di Milano, Department of Computer Science, Via Celoria, 18 - 20133 Milan, Italy
L
Laura Pinnavaia
UniversitĂ  degli Studi di Milano, Department of Languages, Literatures, Cultures and Mediations, Piazza S. Alessandro, 1 - 20123 Milan, Italy
V
Vojimir Ranitovic
UniversitĂ  degli Studi di Milano, Department of Historical Studies, Via Festa del Perdono, 7 - 20126 Milan, Italy
E
Elisabetta Rocchetti
UniversitĂ  degli Studi di Milano, Department of Computer Science, Via Celoria, 18 - 20133 Milan, Italy
A
Alice Tuveri
UniversitĂ  degli Studi di Milano, Department of Languages, Literatures, Cultures and Mediations, Piazza S. Alessandro, 1 - 20123 Milan, Italy