Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Current KGQA benchmarks (e.g., WebQSP, CWQ) suffer from severe quality issues: high annotation error rates (average factual accuracy only 57%), ambiguous or unanswerable questions, and outdated or inconsistent knowledge—undermining evaluation reliability. This work presents the first systematic diagnosis of these flaws and introduces KGQAGen, a novel closed-loop framework for generating verifiable, high-difficulty multi-hop QA instances. KGQAGen integrates structured knowledge anchoring, LLM-in-the-loop generation, and symbolic logical verification. Leveraging Wikidata, we construct KGQAGen-10k—a high-quality benchmark comprising 10,000 rigorously validated instances. Experiments reveal substantial performance degradation of state-of-the-art KG-RAG models on KGQAGen-10k, effectively exposing their genuine reasoning bottlenecks. KGQAGen-10k thus establishes a more reliable, challenging, and diagnostic benchmark for KGQA evaluation.

Technology Category

Application Category

📝 Abstract

Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

Problem

Research questions and friction points this paper is trying to address.

Identifying quality issues in KGQA benchmark datasets

Addressing inaccurate annotations and ambiguous questions in datasets

Developing a scalable framework for reliable KGQA evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-in-the-loop framework KGQAGen

Structured knowledge grounding method

Symbolic verification for QA instances

🔎 Similar Papers

Towards Better Benchmark Datasets for Inductive Knowledge Graph Completion