NeoQA: Evidence-based Question Answering with Generated News Events

📅 2025-05-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of distinguishing “evidence-based reasoning” from “knowledge recall” in large language models (LLMs) for evidence-based question answering (EBQA). To this end, we introduce NeoQA—a novel benchmark featuring a fully controllable, RAG-oriented evaluation paradigm grounded in synthetically generated fictional news events. NeoQA employs event-graph-driven news generation, structured knowledge base construction, and adversarial question-answer pair design to ensure zero contamination and full traceability of evidence sources, while supporting fine-grained reasoning scenarios such as missing or misleading evidence. Experiments reveal that mainstream LLMs suffer substantial accuracy degradation under evidence-detail mismatches or critical information omissions, exposing fundamental weaknesses in evidence perception and verification. NeoQA thus establishes a more robust, attributable, and cheat-resistant evaluation standard for RAG systems.

Technology Category

Application Category

📝 Abstract
Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q&A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from retrieved evidence and only when sufficient evidence is available. NeoQA enables controlled evaluation across various evidence scenarios, including cases with missing or misleading details. Our findings indicate that LLMs struggle to distinguish subtle mismatches between questions and evidence, and suffer from short-cut reasoning when key information required to answer a question is missing from the evidence, underscoring key limitations in evidence-based reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG in LLMs with non-stale benchmarks
Preventing pretraining knowledge use in QA tasks
Assessing LLMs' evidence-based reasoning limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generated fictional news events for evaluation
Used Retrieval-Augmented Generation (RAG) exclusively
Controlled evaluation with missing evidence
🔎 Similar Papers
No similar papers found.