Longitudinal Monitoring of LLM Content Moderation of Social Issues

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Large language models (LLMs) employ opaque and dynamically updated content moderation policies, implicitly shaping public discourse without systematic monitoring mechanisms. To address this gap, we propose AI Watchman—the first longitudinal auditing framework specifically designed to analyze LLM content refusal behavior. It leverages a multilingual dataset spanning 400+ sociopolitical topics to conduct automated, cross-temporal and cross-model audits of leading models—including OpenAI’s GPT-4.1 and GPT-5, and DeepSeek—complemented by qualitative analysis. AI Watchman achieves three novel contributions: (1) detection of unannounced policy shifts in content moderation; (2) quantitative measurement of inter-vendor and inter-model moderation disparities; and (3) construction of a systematic taxonomy of refusal types. Empirical evaluation demonstrates its efficacy in exposing opaque moderation logic and supporting regulatory assessment and public oversight. The framework establishes a reproducible methodological foundation for advancing LLM transparency research.

Technology Category

Application Category

📝 Abstract

Large language models' (LLMs') outputs are shaped by opaque and frequently-changing company content moderation policies and practices. LLM moderation often takes the form of refusal; models' refusal to produce text about certain topics both reflects company policy and subtly shapes public discourse. We introduce AI Watchman, a longitudinal auditing system to publicly measure and track LLM refusals over time, to provide transparency into an important and black-box aspect of LLMs. Using a dataset of over 400 social issues, we audit Open AI's moderation endpoint, GPT-4.1, and GPT-5, and DeepSeek (both in English and Chinese). We find evidence that changes in company policies, even those not publicly announced, can be detected by AI Watchman, and identify company- and model-specific differences in content moderation. We also qualitatively analyze and categorize different forms of refusal. This work contributes evidence for the value of longitudinal auditing of LLMs, and AI Watchman, one system for doing so.

Problem

Research questions and friction points this paper is trying to address.

Monitoring LLM content moderation changes over time

Detecting unannounced policy shifts through refusal patterns

Providing transparency in opaque LLM moderation practices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Longitudinal auditing system tracks LLM refusals

Measures moderation changes across 400 social issues

Detects unannounced policy shifts in multiple models

🔎 Similar Papers

Improving and Assessing the Fidelity of Large Language Models Alignment to Online Communities