🤖 AI Summary
Existing evaluation of explainable recommender systems over-relies on recommendation performance and subjective user feedback, lacking objective, content-oriented metrics for explanation veracity—the intrinsic informational quality of explanations. This paper introduces signal detection theory to explanation evaluation for the first time, proposing a dual-dimensional decomposition framework grounded in fidelity (explanation’s faithfulness to the underlying model) and attunement (explanation’s alignment with user expectations). Based on this, we develop a quantifiable Veracity scoring model that integrates decision-sensitivity analysis. Through multi-scenario simulation experiments, we demonstrate that the model effectively discriminates among explanations of varying informational quality, exhibiting strong discriminative power and robustness. Our work bridges a critical gap in objective, content-centric explanation assessment and establishes a novel, principled benchmark for developing and evaluating explainable recommender systems.
📝 Abstract
There is growing interest in explainable recommender systems that provide recommendations along with explanations for the reasoning behind them. When evaluating recommender systems, most studies focus on overall recommendation performance. Only a few assess the quality of the explanations. Explanation quality is often evaluated through user studies that subjectively gather users' opinions on representative explanatory factors that shape end-users' perspective towards the results, not about the explanation contents itself. We aim to fill this gap by developing an objective metric to evaluate Veracity: the information quality of explanations. Specifically, we decompose Veracity into two dimensions: Fidelity and Attunement. Fidelity refers to whether the explanation includes accurate information about the recommended item. Attunement evaluates whether the explanation reflects the target user's preferences. By applying signal detection theory, we first determine decision outcomes for each dimension and then combine them to calculate a sensitivity, which serves as the final Veracity value. To assess the effectiveness of the proposed metric, we set up four cases with varying levels of information quality to validate whether our metric can accurately capture differences in quality. The results provided meaningful insights into the effectiveness of our proposed metric.