🤖 AI Summary
This work addresses the fragility of existing counterfactual explanation methods, which often fail under minor model perturbations and are typically restricted to specific model architectures, rely on costly hyperparameter tuning, or lack explicit robustness guarantees. To overcome these limitations, the authors propose a novel approach that jointly models the data distribution and decision boundaries through an ensemble of models, training a conditional normalizing flow using probabilistic consensus among the ensemble members. This framework introduces a tunable robustness parameter that flexibly controls the required proportion of ensemble agreement for the target class, without necessitating retraining. Experimental results demonstrate that the method significantly enhances the empirical robustness of generated counterfactuals under model perturbations while preserving high explanation quality.
📝 Abstract
Counterfactual explanations (CFEs) are essential for interpreting black-box models, yet they often become invalid when models are slightly changed. Existing methods for generating robust CFEs are often limited to specific types of models, require costly tuning, or inflexible robustness controls. We propose a novel approach that jointly models the data distribution and the space of plausible model decisions to ensure robustness to model changes. Using a probabilistic consensus over a model ensemble, we train a conditional normalizing flow that captures the data density under varying levels of classifier agreement. At inference time, a single interpretable parameter controls the robustness level; it specifies the minimum fraction of models that should agree on the target class without retraining the generative model. Our method effectively pushes CFEs toward regions that are both plausible and stable across model changes. Experimental results demonstrate that our approach achieves superior empirical robustness while also maintaining good performance across other evaluation measures.