PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

High-quality static benchmarks risk data leakage and inflated performance metrics, undermining the sustainability and credibility of LLM evaluation. To address this, we propose PhantomWiki—a dynamic, on-demand evaluation framework that synthesizes fact-consistent documents and question-answer pairs via knowledge-graph-guided controllable text generation, structured QA template modeling, and multi-granularity difficulty control. PhantomWiki introduces the “just-in-time generation” paradigm, eliminating dataset contamination entirely; it is the first framework to decouple reasoning difficulty from retrieval scale control; and it operates without reliance on real-world corpora. Experiments demonstrate that PhantomWiki poses substantial challenges to state-of-the-art LLMs, validating its effectiveness and novelty as a scalable, contamination-resistant evaluation benchmark.

Technology Category

Application Category

📝 Abstract

High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at https://github.com/kilian-group/phantom-wiki.

Problem

Research questions and friction points this paper is trying to address.

Generates unique datasets for evaluation

Prevents data leakage in LLM testing

Disentangles reasoning and retrieval capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates unique on-demand datasets

Varies question difficulty and corpus size

Prevents data leakage in evaluations

🔎 Similar Papers

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

2024-07-16arXiv.orgCitations: 24

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering

2024-05-22arXiv.orgCitations: 5