SourceBench: Can AI Answers Reference Quality Web Sources?

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical gap in current AI evaluation practices, which predominantly focus on answer correctness while neglecting the quality of cited sources. To bridge this gap, the authors introduce SourceBench—the first multidimensional benchmark specifically designed to evaluate web citations generated by AI systems—assessing eight key dimensions: relevance, accuracy, objectivity, timeliness, authority, among others. Leveraging 100 real-world queries and 3,996 cited sources, the study employs both human annotations and a calibrated LLM-based evaluator to systematically assess eight large language models, Google Search, and three AI-powered search tools. The findings reveal significant disparities between generative AI and traditional search in evidence citation practices and yield four core insights to guide the future integration of generative AI with search technologies.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.
Problem

Research questions and friction points this paper is trying to address.

source quality
large language models
web citations
evidence evaluation
information reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

SourceBench
source quality evaluation
large language models
evidence quality
AI search
🔎 Similar Papers
No similar papers found.
H
Hexi Jin
University of California, San Diego
S
Stephen Liu
University of California, San Diego
Y
Yuheng Li
University of California, San Diego
S
Simran Malik
University of California, San Diego
Yiying Zhang
Yiying Zhang
University of California, San Diego
Systems+MLOperating SystemsData CenterCloud Computing