HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation

πŸ“… 2025-03-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the problem of outdated information in retrieval-augmented generation (RAG) knowledge bases, which degrades response accuracy and introduces harmful outputs. To systematically expose the dual impact of obsolescence on RAG accuracy and safety, we introduce HoHβ€”the first dynamic temporal benchmark for RAG. Methodologically, we propose the first analytical framework focused on the *outdated interference mechanism*, and design a temporal QA data generation method that synergistically combines token-level diff analysis with large language models, integrating temporal knowledge modeling and diagnostic RAG evaluation. Empirical results demonstrate that state-of-the-art RAG systems exhibit severe vulnerability to outdated information in both retrieval and generation stages; such obsolescence significantly reduces answer accuracy and triggers factual hallucinations as well as unsafe content. This work establishes foundational benchmarks and methodologies for advancing robustness and trustworthiness in RAG systems.

Technology Category

Application Category

πŸ“ Abstract
While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures temporal knowledge evolution in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG.
Problem

Research questions and friction points this paper is trying to address.

Evaluates outdated information impact on Retrieval-Augmented Generation (RAG).
Creates benchmark to assess temporal knowledge evolution in RAG.
Reveals outdated info reduces accuracy and risks harmful outputs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

HoH benchmark evaluates outdated information impact
Token-level diff algorithms create temporal QA dataset
Reveals outdated info reduces RAG accuracy, misleads models
πŸ”Ž Similar Papers
No similar papers found.
J
Jie Ouyang
State Key Lab of Cognitive Intelligence, University of Science and Technology of China
Tingyue Pan
Tingyue Pan
University of Science and Technology of China
Time SeriesMulti Modal
M
Mingyue Cheng
State Key Lab of Cognitive Intelligence, University of Science and Technology of China
Ruiran Yan
Ruiran Yan
University of Science and Technology of China
RSIRLLM
Y
Yucong Luo
State Key Lab of Cognitive Intelligence, University of Science and Technology of China
Jiaying Lin
Jiaying Lin
Peking University
Computer VisionMultimodal
Q
Qi Liu
State Key Lab of Cognitive Intelligence, University of Science and Technology of China