What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study investigates whether proprietary large language models (LLMs) implicitly sanitize sensitive content—even without explicit instructions or fine-tuning—revealing potential intrinsic value alignment. Method: Using GPT-4o-mini as a testbed, we conduct zero-shot sensitivity classification and empirical paraphrasing experiments to systematically quantify the degree to which the model attenuates derogatory and taboo language during rewriting. Contribution/Results: We provide the first empirical evidence that LLMs exhibit inherent content moderation tendencies despite lacking domain-specific sensitive-content training: rewritten outputs show statistically significant reductions in sensitivity scores and marked decreases in derogatory/taboo lexical usage. Moreover, the model’s zero-shot sensitivity classification accuracy surpasses that of conventional baseline models. These findings uncover latent value-alignment mechanisms in black-box LLMs, offering novel insights into implicit safety boundaries and the embeddedness of normative values in generative AI systems.

Technology Category

Application Category

📝 Abstract

Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' implicit moderation of sensitive content

Analyzes GPT-4o-mini's paraphrasing behavior for sensitivity shifts

Compares LLMs' zero-shot sensitivity classification with traditional methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical analysis of implicit content moderation

Evaluates sensitivity shifts in paraphrasing

Zero-shot classification of sentence sensitivity

🔎 Similar Papers

Taxonomy and Analysis of Sensitive User Queries in Generative AI Search