Evaluating Commercial AI Chatbots as News Intermediaries

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study presents the first systematic evaluation of mainstream AI chatbots as news intermediaries, assessing their factual accuracy and fairness across multilingual and multi-regional contexts. Drawing on 14 days of BBC news coverage from six regions, the authors constructed 2,100 time-sensitive factual questions employing multiple-choice and free-response formats, alongside error attribution, citation tracing, and adversarial questioning. The analysis reveals three core issues: retrieval bias, linguistic inequity, and vulnerability to false premises. While top-performing models achieved over 90% accuracy on multiple-choice questions, performance dropped by 11–17% in free-response settings. Hindi exhibited the poorest results (79% accuracy), with more than 70% of errors attributable to retrieval failure. When confronted with questions containing false premises, accuracy plummeted to 19–70%, and some models accepted fabricated facts in up to 64% of cases, highlighting a decoupling between fact-checking and response recovery capabilities.

📝 Abstract

AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.

Problem

Research questions and friction points this paper is trying to address.

AI chatbots

news intermediaries

factual accuracy

multilingual evaluation

retrieval bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

retrieval bias

multilingual news evaluation

false premise vulnerability