🤖 AI Summary
This work addresses the fragmented, inconsistent, and poorly reproducible evaluation practices in current knowledge graph question answering systems for SPARQL query generation, which lack a unified standard. The authors propose an open-source, modular evaluation framework that, for the first time, integrates over twenty metrics spanning lexical, syntactic, semantic, execution, and ranking dimensions, supporting assessment at both query and answer levels. Inspired by ir-measures, the framework decouples evaluation specifications from implementation through an abstraction layer and incorporates diverse techniques including token-level F1, CodeBLEU, graph structure matching, answer-set Jaccard similarity, MRR, NDCG, and LLM-as-a-Judge. This approach substantially enhances diagnostic capabilities beyond traditional answer-correctness-only paradigms, advancing standardization and reproducibility in SPARQL generation evaluation.
📝 Abstract
The evaluation of Question Answering (QA) systems over Knowledge Graphs has historically suffered from fragmentation, inconsistency, and limited reproducibility. While significant progress has been made in semantic parsing and SPARQL query generation, evaluation methodologies remain diverse, ad hoc, and often incomparable across studies. Existing benchmarks typically focus on a small subset of metrics, such as query exact match or answer-level F1, neglecting syntactic validity, semantic faithfulness, execution correctness, results ranking quality, and computational efficiency. In this paper, we present t2s-metrics, an open-source, extensible, and unified evaluation library designed specifically for SPARQL query comparison and execution-based assessment. t2s-metrics provides a broad and extensible set of over 20 evaluation metrics, collected from the literature and practical evaluation needs, spanning lexical, syntactic, semantic, structural, execution-based and ranking-based dimensions. These include query-based metrics such as token-level Precision, Recall, and F1; BLEU, ROUGE, METEOR, and CodeBLEU variants; variable-normalized metrics (SP-BLEU, SP-F1); graph-and URI-based exact match metrics; as well as answer set-based metrics such as F1-QALD and Jaccard similarity; ranking metrics including MRR, NDCG, P@k, and Hit@k; and LLM-as-a-Judge metrics. Taking inspiration from the ir-metrics library for Information Retrieval, t2s-metrics provides a modular abstraction layer that decouples metric specification from implementation, enabling consistent, transparent, and reproducible evaluation of SPARQLbased QA systems. We argue that t2s-metrics constitutes a necessary step toward systematic, standardized evaluation in question answering over knowledge graphs and facilitates deeper diagnostic insights into system behavior beyond answer correctness.