Explainable Benchmarking through the Lense of Concept Learning

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks rely heavily on a limited set of aggregate metrics, hindering fine-grained performance analysis and interpretable evaluation. To address this, we propose an interpretability-aware benchmarking framework that pioneers the integration of concept learning into system performance attribution. Specifically, we introduce PruneCEL—a task-driven conceptual pruning and semantic alignment method—that automatically generates human-understandable explanations for knowledge graph question answering (KGQA) system behavior. PruneCEL enables end-to-end, automated attribution over large-scale knowledge graphs without manual annotation. Empirically, it achieves a 0.55-point F1 improvement over state-of-the-art baselines. In user studies, participants accurately predicted system behavior in 80% of cases using our explanations, demonstrating substantial gains in evaluation transparency and practical utility.

Technology Category

Application Category

📝 Abstract
Evaluating competing systems in a comparable way, i.e., benchmarking them, is an undeniable pillar of the scientific method. However, system performance is often summarized via a small number of metrics. The analysis of the evaluation details and the derivation of insights for further development or use remains a tedious manual task with often biased results. Thus, this paper argues for a new type of benchmarking, which is dubbed explainable benchmarking. The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark. We provide a first instantiation of this paradigm for knowledge-graph-based question answering systems. We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL. Our evaluation shows that PruneCEL outperforms state-of-the-art concept learners on the task of explainable benchmarking by up to 0.55 points F1 measure. A task-driven user study with 41 participants shows that in 80% of the cases, the majority of participants can accurately predict the behavior of a system based on our explanations. Our code and data are available at https://github.com/dice-group/PruneCEL/tree/K-cap2025
Problem

Research questions and friction points this paper is trying to address.

Automating performance explanation generation for benchmarking systems
Addressing manual analysis bias in system evaluation metrics
Applying concept learning to explain knowledge graph QA systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces explainable benchmarking for system evaluation
Uses PruneCEL concept learning on knowledge graphs
Generates performance explanations for question answering systems
🔎 Similar Papers
No similar papers found.
Q
Quannian Zhang
Faculty of Computer Science, Electrical Engineering and Mathematics, Data Science Group (DICE), Heinz Nixdorf Institute, Paderborn University, Paderborn, North Rhine-Westphalia, Germany
Michael Röder
Michael Röder
Faculty of Computer Science, Electrical Engineering and Mathematics, Data Science Group (DICE), Heinz Nixdorf Institute, Paderborn University, Paderborn, North Rhine-Westphalia, Germany
N
Nikit Srivastava
Faculty of Computer Science, Electrical Engineering and Mathematics, Data Science Group (DICE), Heinz Nixdorf Institute, Paderborn University, Paderborn, North Rhine-Westphalia, Germany
N
N’Dah Jean Kouagou
Faculty of Computer Science, Electrical Engineering and Mathematics, Data Science Group (DICE), Heinz Nixdorf Institute, Paderborn University, Paderborn, North Rhine-Westphalia, Germany
Axel-Cyrille Ngonga Ngomo
Axel-Cyrille Ngonga Ngomo
Professor of Data Science at Paderborn University, Heinz Nixdorf Institute
Knowledge GraphsKnowledge EngineeringSemantic WebMachine Learning