From Top-1 to Top-K: A Reproducibility Study and Benchmarking of Counterfactual Explanations for Recommender Systems

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study addresses the lack of a unified evaluation protocol in existing counterfactual explanation methods for recommender systems, which has led to reproducibility challenges and biased comparisons. To bridge this gap, the authors establish the first comprehensive benchmark that encompasses both explicit and implicit explanations, item-level and list-level evaluations, and multiple perturbation scopes. They systematically reimplement and reevaluate eleven prominent methods—including LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE, and GNN-based explainers—under Top-K recommendation settings, assessing them across effectiveness, sparsity, and computational complexity. The analysis reveals that method performance exhibits strong dependence on experimental configurations, particularly in the trade-off between effectiveness and sparsity; item-level and list-level evaluations yield highly consistent results; and several GNN explainers encounter scalability bottlenecks on large-scale graphs.

Technology Category

Application Category

📝 Abstract

Counterfactual explanations (CEs) provide an intuitive way to understand recommender systems by identifying minimal modifications to user-item interactions that alter recommendation outcomes. Existing CE methods for recommender systems, however, have been evaluated under heterogeneous protocols, using different datasets, recommenders, metrics, and even explanation formats, which hampers reproducibility and fair comparison. Our paper systematically reproduces, re-implement, and re-evaluate eleven state-of-the-art CE methods for recommender systems, covering both native explainers (e.g., LIME-RS, SHAP, PRINCE, ACCENT, LXR, GREASE) and specific graph-based explainers originally proposed for GNNs. Here, a unified benchmarking framework is proposed to assess explainers along three dimensions: explanation format (implicit vs. explicit), evaluation level (item-level vs. list-level), and perturbation scope (user interaction vectors vs. user-item interaction graphs). Our evaluation protocol includes effectiveness, sparsity, and computational complexity metrics, and extends existing item-level assessments to top-K list-level explanations. Through extensive experiments on three real-world datasets and six representative recommender models, we analyze how well previously reported strengths of CE methods generalize across diverse setups. We observe that the trade-off between effectiveness and sparsity depends strongly on the specific method and evaluation setting, particularly under the explicit format; in addition, explainer performance remains largely consistent across item level and list level evaluations, and several graph-based explainers exhibit notable scalability limitations on large recommender graphs. Our results refine and challenge earlier conclusions about the robustness and practicality of CE generation methods in recommender systems: https://github.com/L2R-UET/CFExpRec.

Problem

Research questions and friction points this paper is trying to address.

counterfactual explanations

recommender systems

reproducibility

benchmarking

evaluation protocols

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Explanations

Recommender Systems

Benchmarking Framework