🤖 AI Summary
Existing LLM benchmarks predominantly rely on static text evaluation, failing to adequately assess models’ capabilities in perceiving, retaining, and structurally generating real-time information. This work introduces the first dynamic benchmark tailored for real-time report generation, encompassing two realistic settings: document-free generation and external-document-augmented generation. Methodologically, we propose a novel dual-path retrieval mechanism—integrating web search with a curated local report repository—coupled with a domain-knowledge-constrained information synthesis paradigm and a timeliness-aware evaluation protocol. We further develop a domain-adaptive report generation system grounded in this framework. Experimental results demonstrate that our approach achieves state-of-the-art performance, outperforming GPT-4o by 7.0% and 5.8% in the two respective settings. To foster reproducibility and community advancement, we will publicly release both the codebase and the benchmark dataset.
📝 Abstract
Traditional benchmarks for large language models (LLMs) typically rely on static evaluations through storytelling or opinion expression, which fail to capture the dynamic requirements of real-time information processing in contemporary applications. To address this limitation, we present DynamicBench, a benchmark designed to evaluate the proficiency of LLMs in storing and processing up-to-the-minute data. DynamicBench utilizes a dual-path retrieval pipeline, integrating web searches with local report databases. It necessitates domain-specific knowledge, ensuring accurate responses report generation within specialized fields. By evaluating models in scenarios that either provide or withhold external documents, DynamicBench effectively measures their capability to independently process recent information or leverage contextual enhancements. Additionally, we introduce an advanced report generation system adept at managing dynamic information synthesis. Our experimental results confirm the efficacy of our approach, with our method achieving state-of-the-art performance, surpassing GPT4o in document-free and document-assisted scenarios by 7.0% and 5.8%, respectively. The code and data will be made publicly available.