PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

📅 2025-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
High-quality static benchmarks risk data leakage and inflated performance metrics, undermining the sustainability and credibility of LLM evaluation. To address this, we propose PhantomWiki—a dynamic, on-demand evaluation framework that synthesizes fact-consistent documents and question-answer pairs via knowledge-graph-guided controllable text generation, structured QA template modeling, and multi-granularity difficulty control. PhantomWiki introduces the “just-in-time generation” paradigm, eliminating dataset contamination entirely; it is the first framework to decouple reasoning difficulty from retrieval scale control; and it operates without reliance on real-world corpora. Experiments demonstrate that PhantomWiki poses substantial challenges to state-of-the-art LLMs, validating its effectiveness and novelty as a scalable, contamination-resistant evaluation benchmark.

Technology Category

Application Category

📝 Abstract
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at https://github.com/kilian-group/phantom-wiki.
Problem

Research questions and friction points this paper is trying to address.

Generates unique datasets for evaluation
Prevents data leakage in LLM testing
Disentangles reasoning and retrieval capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates unique on-demand datasets
Varies question difficulty and corpus size
Prevents data leakage in evaluations
🔎 Similar Papers
No similar papers found.
Albert Gong
Albert Gong
Cornell University
K
Kamil.e Stankevivciut.e
Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
Chao Wan
Chao Wan
Cornell University
Machine LearningNatural Language Processing
Anmol Kabra
Anmol Kabra
Cornell University
Machine Learning
Raphael Thesmar
Raphael Thesmar
Cornell University
Deep LearningNLPComputer Vision
Johann Lee
Johann Lee
Cornell University
Machine Learning
J
Julius Klenke
Department of Computer Science, Cornell University, Ithaca, New York, USA
C
Carla P. Gomes
Department of Computer Science, Cornell University, Ithaca, New York, USA
K
Kilian Q. Weinberger
Department of Computer Science, Cornell University, Ithaca, New York, USA