Explainable Benchmarking through the Lense of Concept Learning

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing benchmarks rely heavily on a limited set of aggregate metrics, hindering fine-grained performance analysis and interpretable evaluation. To address this, we propose an interpretability-aware benchmarking framework that pioneers the integration of concept learning into system performance attribution. Specifically, we introduce PruneCEL—a task-driven conceptual pruning and semantic alignment method—that automatically generates human-understandable explanations for knowledge graph question answering (KGQA) system behavior. PruneCEL enables end-to-end, automated attribution over large-scale knowledge graphs without manual annotation. Empirically, it achieves a 0.55-point F1 improvement over state-of-the-art baselines. In user studies, participants accurately predicted system behavior in 80% of cases using our explanations, demonstrating substantial gains in evaluation transparency and practical utility.

Technology Category

Application Category

📝 Abstract

Evaluating competing systems in a comparable way, i.e., benchmarking them, is an undeniable pillar of the scientific method. However, system performance is often summarized via a small number of metrics. The analysis of the evaluation details and the derivation of insights for further development or use remains a tedious manual task with often biased results. Thus, this paper argues for a new type of benchmarking, which is dubbed explainable benchmarking. The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark. We provide a first instantiation of this paradigm for knowledge-graph-based question answering systems. We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL. Our evaluation shows that PruneCEL outperforms state-of-the-art concept learners on the task of explainable benchmarking by up to 0.55 points F1 measure. A task-driven user study with 41 participants shows that in 80% of the cases, the majority of participants can accurately predict the behavior of a system based on our explanations. Our code and data are available at https://github.com/dice-group/PruneCEL/tree/K-cap2025

Problem

Research questions and friction points this paper is trying to address.

Automating performance explanation generation for benchmarking systems

Addressing manual analysis bias in system evaluation metrics

Applying concept learning to explain knowledge graph QA systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces explainable benchmarking for system evaluation

Uses PruneCEL concept learning on knowledge graphs

Generates performance explanations for question answering systems

🔎 Similar Papers

No similar papers found.