LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News

📅 2026-02-14

📈 Citations: 1

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the lack of effective evaluation benchmarks for large language models (LLMs) in active web search tasks that require real-time information retrieval and multi-hop reasoning. To this end, we propose LiveNewsBench—the first dynamic news evaluation benchmark supporting high-frequency updates and explicitly targeting multi-hop search and reasoning. The benchmark automatically constructs question-answer pairs from live news streams, clearly disentangling a model’s internal knowledge from its external search capabilities, and incorporates human verification to ensure reliability. Additionally, we curate a large-scale training dataset to alleviate data scarcity and publicly release the benchmark, dataset, and leaderboard. We conduct a systematic empirical evaluation of mainstream open-source and commercial LLMs along with their search interfaces, providing the community with a robust and reliable evaluation framework.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) with agentic web search capabilities show strong potential for tasks requiring real-time information access and complex fact retrieval, yet evaluating such systems remains challenging. We introduce \bench, a rigorous and regularly updated benchmark designed to assess the agentic web search abilities of LLMs. \bench automatically generates fresh question-answer pairs from recent news articles, ensuring that questions require information beyond an LLM's training data and enabling clear separation between internal knowledge and search capability. The benchmark features intentionally difficult questions requiring multi-hop search queries, page visits, and reasoning, making it well-suited for evaluating agentic search behavior. Our automated data curation and question generation pipeline enables frequent benchmark updates and supports construction of a large-scale training dataset for agentic web search models, addressing the scarcity of such data in the research community. To ensure reliable evaluation, we include a subset of human-verified samples in the test set. We evaluate a broad range of systems using \bench, including commercial and open-weight LLMs as well as LLM-based web search APIs. The leaderboard, datasets, and code are publicly available at livenewsbench.com.

Problem

Research questions and friction points this paper is trying to address.

LLM evaluation

agentic web search

real-time information

benchmark

fact retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic web search

live news benchmark

multi-hop reasoning