🤖 AI Summary
Existing benchmarks rely heavily on a limited set of aggregate metrics, hindering fine-grained performance analysis and interpretable evaluation. To address this, we propose an interpretability-aware benchmarking framework that pioneers the integration of concept learning into system performance attribution. Specifically, we introduce PruneCEL—a task-driven conceptual pruning and semantic alignment method—that automatically generates human-understandable explanations for knowledge graph question answering (KGQA) system behavior. PruneCEL enables end-to-end, automated attribution over large-scale knowledge graphs without manual annotation. Empirically, it achieves a 0.55-point F1 improvement over state-of-the-art baselines. In user studies, participants accurately predicted system behavior in 80% of cases using our explanations, demonstrating substantial gains in evaluation transparency and practical utility.
📝 Abstract
Evaluating competing systems in a comparable way, i.e., benchmarking them, is an undeniable pillar of the scientific method. However, system performance is often summarized via a small number of metrics. The analysis of the evaluation details and the derivation of insights for further development or use remains a tedious manual task with often biased results. Thus, this paper argues for a new type of benchmarking, which is dubbed explainable benchmarking. The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark. We provide a first instantiation of this paradigm for knowledge-graph-based question answering systems. We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL. Our evaluation shows that PruneCEL outperforms state-of-the-art concept learners on the task of explainable benchmarking by up to 0.55 points F1 measure. A task-driven user study with 41 participants shows that in 80% of the cases, the majority of participants can accurately predict the behavior of a system based on our explanations. Our code and data are available at https://github.com/dice-group/PruneCEL/tree/K-cap2025