🤖 AI Summary
Current KGQA benchmarks (e.g., WebQSP, CWQ) suffer from severe quality issues: high annotation error rates (average factual accuracy only 57%), ambiguous or unanswerable questions, and outdated or inconsistent knowledge—undermining evaluation reliability. This work presents the first systematic diagnosis of these flaws and introduces KGQAGen, a novel closed-loop framework for generating verifiable, high-difficulty multi-hop QA instances. KGQAGen integrates structured knowledge anchoring, LLM-in-the-loop generation, and symbolic logical verification. Leveraging Wikidata, we construct KGQAGen-10k—a high-quality benchmark comprising 10,000 rigorously validated instances. Experiments reveal substantial performance degradation of state-of-the-art KG-RAG models on KGQAGen-10k, effectively exposing their genuine reasoning bottlenecks. KGQAGen-10k thus establishes a more reliable, challenging, and diagnostic benchmark for KGQA evaluation.
📝 Abstract
Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.