🤖 AI Summary
This paper addresses the problem of inflated performance in large language models (LLMs) for natural language-to-SPARQL generation, arising from training data memorization or knowledge graph (KG) leakage during evaluation. To mitigate this, we propose a controlled evaluation framework that introduces anonymized knowledge injection—a novel technique enabling systematic separation of the model’s intrinsic knowledge from external KG-derived information under zero-shot settings, while ensuring portability across arbitrary KGs. Our contributions are threefold: (1) revealing the true generalization capability of LLMs in KG-aware question answering (KGQA) for SPARQL generation; (2) establishing a reproducible ablation paradigm to rigorously distinguish performance gains attributable to methodological advances versus spurious improvements from training data leakage; and (3) substantially enhancing the robustness and reliability of evaluation outcomes, thereby providing both theoretical foundations and practical tools for trustworthy deployment of KGQA systems.
📝 Abstract
Nowadays, the importance of software with natural-language user interfaces cannot be underestimated. In particular, in Question Answering (QA) systems, generating a SPARQL query for a given natural-language question (often named Query Building) from the information retrieved from the same question is the central task of QA systems working over Knowledge Graphs (KGQA). Due to the rise of Large Language Models (LLMs), they are considered a well-suited method to increase the quality of the question-answering functionality, as there is still a lot of room for improvement, aiming for enhanced quality and trustworthiness. However, LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data. In this paper, we introduce a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question under various conditions: (1) zero-shot SPARQL generation, (2) with knowledge injection, and (3) with "anonymized" knowledge injection. This enables us, for the first time, to estimate the influence of the training data on the QA quality improved by LLMs. Ultimately, this will help to identify how portable a method is or whether good results might mostly be achieved because a benchmark was already included in the training data (cf. LLM memorization). The developed method is portable, robust, and supports any knowledge graph; therefore, it could be easily applied to any KGQA or LLM, s.t., generating consistent insights into the actual LLM capabilities is possible.