π€ AI Summary
This work addresses the challenge of generating minimal, high-fidelity causal explanations for pre-trained image classifiers. The proposed method trains a lightweight autoencoder to produce binary masks that retain only the image regions critical to the classifierβs decision. Its core innovation lies in a joint optimization framework integrating multi-layer activation matching and abductive constraints: KL divergence aligns intermediate-layer activation distributions between the original and masked inputs, while label preservation loss, L1 sparsity regularization, total variation smoothing, and binary cross-entropy enforce semantic compactness and fidelity. Extensive evaluation across multiple models and datasets demonstrates that the approach effectively suppresses irrelevant background, precisely localizes discriminative regions, and achieves an optimal trade-off among explanation fidelity, minimality, and interpretability. The resulting visual attributions constitute an efficient, verifiable post-hoc explanation mechanism for black-box vision models.
π Abstract
In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image (x) and a frozen model (f), we train a lightweight autoencoder to output a binary mask (m) such that the explanation (e = m odot x) preserves both the model's prediction and the intermediate activations of (x). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.