๐ค AI Summary
Large language models (LLMs) are increasingly deployed for news source credibility assessment, yet their reliability, consistency, and political biases remain poorly characterized.
Method: We systematically evaluate nine state-of-the-art LLMs using Spearman correlation analysis, cross-model consistency checks, expert annotation benchmarking, and controlled prompt-based bias induction experiments.
Contribution/Results: We identifyโ for the first timeโa pervasive liberal bias across LLMs in credibility judgments; role-based prompting exacerbates political consistency bias; and larger models exhibit counterintuitive increases in refusal rates and error rates. LLMs show high inter-model agreement (ฯ = 0.79) but only moderate alignment with human experts (ฯ = 0.50). These findings reveal systematic, structurally embedded biases in LLM-based credibility evaluation, providing critical empirical evidence for designing trustworthy AI systems and media literacy tools.
๐ Abstract
Search engines increasingly leverage large language models (LLMs) to generate direct answers, and AI chatbots now access the Internet for fresh data. As information curators for billions of users, LLMs must assess the accuracy and reliability of different sources. This paper audits nine widely used LLMs from three leading providers -- OpenAI, Google, and Meta -- to evaluate their ability to discern credible and high-quality information sources from low-credibility ones. We find that while LLMs can rate most tested news outlets, larger models more frequently refuse to provide ratings due to insufficient information, whereas smaller models are more prone to making errors in their ratings. For sources where ratings are provided, LLMs exhibit a high level of agreement among themselves (average Spearman's $
ho = 0.79$), but their ratings align only moderately with human expert evaluations (average $
ho = 0.50$). Analyzing news sources with different political leanings in the US, we observe a liberal bias in credibility ratings yielded by all LLMs in default configurations. Additionally, assigning partisan roles to LLMs consistently induces strong politically congruent bias in their ratings. These findings have important implications for the use of LLMs in curating news and political information.