🤖 AI Summary
High-quality static benchmarks risk data leakage and inflated performance metrics, undermining the sustainability and credibility of LLM evaluation. To address this, we propose PhantomWiki—a dynamic, on-demand evaluation framework that synthesizes fact-consistent documents and question-answer pairs via knowledge-graph-guided controllable text generation, structured QA template modeling, and multi-granularity difficulty control. PhantomWiki introduces the “just-in-time generation” paradigm, eliminating dataset contamination entirely; it is the first framework to decouple reasoning difficulty from retrieval scale control; and it operates without reliance on real-world corpora. Experiments demonstrate that PhantomWiki poses substantial challenges to state-of-the-art LLMs, validating its effectiveness and novelty as a scalable, contamination-resistant evaluation benchmark.
📝 Abstract
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at https://github.com/kilian-group/phantom-wiki.