Are LLMs Good Safety Agents or a Propaganda Engine?

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study investigates whether large language models’ (LLMs) refusals to respond to politically sensitive content stem from safety-aligned safeguards or implicit political censorship. To this end, we introduce PSP—the first political-contextual refusal-behavior probing dataset—systematically covering both explicit and implicit political topics, augmented with cross-national censorship corpora for comparative analysis. Methodologically, we integrate data-driven refusal pattern analysis, representation-space interventions (e.g., concept erasure), prompt injection attacks (PIA), generalization of sensitive content, and masked intent modeling. Our findings reveal that mainstream LLMs exhibit refusal patterns highly consistent with real-world political censorship regimes, demonstrating systematic cross-lingual and cross-regional shifts in refusal behavior. Crucially, we empirically and representationally disentangle safety compliance from political censorship for the first time, identifying key semantic and geo-locational attributes governing refusal decisions.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are trained to refuse to respond to harmful content. However, systematic analyses of whether this behavior is truly a reflection of its safety policies or an indication of political censorship, that is practiced globally by countries, is lacking. Differentiating between safety influenced refusals or politically motivated censorship is hard and unclear. For this purpose we introduce PSP, a dataset built specifically to probe the refusal behaviors in LLMs from an explicitly political context. PSP is built by formatting existing censored content from two data sources, openly available on the internet: sensitive prompts in China generalized to multiple countries, and tweets that have been censored in various countries. We study: 1) impact of political sensitivity in seven LLMs through data-driven (making PSP implicit) and representation-level approaches (erasing the concept of politics); and, 2) vulnerability of models on PSP through prompt injection attacks (PIAs). Associating censorship with refusals on content with masked implicit intent, we find that most LLMs perform some form of censorship. We conclude with summarizing major attributes that can cause a shift in refusal distributions across models and contexts of different countries.

Problem

Research questions and friction points this paper is trying to address.

Analyzing whether LLM refusal behaviors reflect safety policies or political censorship

Developing a dataset to probe political refusal behaviors across different countries

Investigating model vulnerability through prompt injection attacks on sensitive content

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed PSP dataset for political refusal analysis

Applied data-driven and representation-level probing approaches

Tested model vulnerability using prompt injection attacks

🔎 Similar Papers

No similar papers found.