🤖 AI Summary
This study exposes latent ideological bias in large language models (LLMs) arising from the use of politically aligned personas in content moderation. We conduct systematic experiments across multiple LLM architectures and parameter scales, employing fine-grained behavioral analysis and cross-modal (language/vision) consistency evaluation to assess how personas with distinct political orientations influence harmful-content classification. Results demonstrate that models exhibit strong affinity for ideologically congruent personas—reinforcing internal coherence while amplifying inter-persona disagreement—and display self-positioned advocacy in politically salient tasks. Although overall classification accuracy remains stable, decision tendencies manifest systematic ideological skew. This work provides the first empirical evidence of partisan bias in ostensibly neutral LLM-based moderation, revealing the “cloaked neutrality” mechanism. It establishes foundational theoretical insights and methodological tools for developing fair, transparent, and interpretable content moderation frameworks.
📝 Abstract
Large language models (LLMs) are increasingly used in content moderation systems, where ensuring fairness and neutrality is essential. In this study, we examine how persona adoption influences the consistency and fairness of harmful content classification across different LLM architectures, model sizes, and content modalities (language vs. vision). At first glance, headline performance metrics suggest that personas have little impact on overall classification accuracy. However, a closer analysis reveals important behavioral shifts. Personas with different ideological leanings display distinct propensities to label content as harmful, showing that the lens through which a model "views" input can subtly shape its judgments. Further agreement analyses highlight that models, particularly larger ones, tend to align more closely with personas from the same political ideology, strengthening within-ideology consistency while widening divergence across ideological groups. To show this effect more directly, we conducted an additional study on a politically targeted task, which confirmed that personas not only behave more coherently within their own ideology but also exhibit a tendency to defend their perspective while downplaying harmfulness in opposing views. Together, these findings highlight how persona conditioning can introduce subtle ideological biases into LLM outputs, raising concerns about the use of AI systems that may reinforce partisan perspectives under the guise of neutrality.