NeoQA: Evidence-based Question Answering with Generated News Events

📅 2025-05-09

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This paper addresses the challenge of distinguishing “evidence-based reasoning” from “knowledge recall” in large language models (LLMs) for evidence-based question answering (EBQA). To this end, we introduce NeoQA—a novel benchmark featuring a fully controllable, RAG-oriented evaluation paradigm grounded in synthetically generated fictional news events. NeoQA employs event-graph-driven news generation, structured knowledge base construction, and adversarial question-answer pair design to ensure zero contamination and full traceability of evidence sources, while supporting fine-grained reasoning scenarios such as missing or misleading evidence. Experiments reveal that mainstream LLMs suffer substantial accuracy degradation under evidence-detail mismatches or critical information omissions, exposing fundamental weaknesses in evidence perception and verification. NeoQA thus establishes a more robust, attributable, and cheat-resistant evaluation standard for RAG systems.

Technology Category

Application Category

📝 Abstract

Evaluating Retrieval-Augmented Generation (RAG) in large language models (LLMs) is challenging because benchmarks can quickly become stale. Questions initially requiring retrieval may become answerable from pretraining knowledge as newer models incorporate more recent information during pretraining, making it difficult to distinguish evidence-based reasoning from recall. We introduce NeoQA (News Events for Out-of-training Question Answering), a benchmark designed to address this issue. To construct NeoQA, we generated timelines and knowledge bases of fictional news events and entities along with news articles and Q&A pairs to prevent LLMs from leveraging pretraining knowledge, ensuring that no prior evidence exists in their training data. We propose our dataset as a new platform for evaluating evidence-based question answering, as it requires LLMs to generate responses exclusively from retrieved evidence and only when sufficient evidence is available. NeoQA enables controlled evaluation across various evidence scenarios, including cases with missing or misleading details. Our findings indicate that LLMs struggle to distinguish subtle mismatches between questions and evidence, and suffer from short-cut reasoning when key information required to answer a question is missing from the evidence, underscoring key limitations in evidence-based reasoning.

Problem

Research questions and friction points this paper is trying to address.

Evaluating RAG in LLMs with non-stale benchmarks

Preventing pretraining knowledge use in QA tasks

Assessing LLMs' evidence-based reasoning limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generated fictional news events for evaluation

Used Retrieval-Augmented Generation (RAG) exclusively

Controlled evaluation with missing evidence

🔎 Similar Papers

No similar papers found.