🤖 AI Summary
This work addresses the fragility of existing counterfactual explanations under frequent model updates. To this end, it proposes the first provably robust counterfactual generation framework, which introduces a formal ⟨δ, ε⟩-set definition to guarantee probabilistic safety (δ-safety) and robustness (ε-robustness) against model changes. By integrating Bayesian modeling with uncertainty-aware optimization, the framework ensures that generated explanations remain valid and stable even as the underlying model evolves. Experimental results demonstrate that the resulting counterfactuals are not only more discriminative and plausible but also maintain theoretically verifiable stability after model updates, offering a significant advance toward reliable and trustworthy post-hoc interpretability in dynamic learning environments.
📝 Abstract
Counterfactual explanations (CEs) offer interpretable insights into machine learning predictions by answering ``what if?"questions. However, in real-world settings where models are frequently updated, existing counterfactual explanations can quickly become invalid or unreliable. In this paper, we introduce Probabilistically Safe CEs (PSCE), a method for generating counterfactual explanations that are $\delta$-safe, to ensure high predictive confidence, and $\epsilon$-robust to ensure low predictive variance. Based on Bayesian principles, PSCE provides formal probabilistic guarantees for CEs under model changes which are adhered to in what we refer to as the $\langle \delta, \epsilon \rangle$-set. Uncertainty-aware constraints are integrated into our optimization framework and we validate our method empirically across diverse datasets. We compare our approach against state-of-the-art Bayesian CE methods, where PSCE produces counterfactual explanations that are not only more plausible and discriminative, but also provably robust under model change.