Activation Matching for Explanation Generation

πŸ“… 2025-09-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of generating minimal, high-fidelity causal explanations for pre-trained image classifiers. The proposed method trains a lightweight autoencoder to produce binary masks that retain only the image regions critical to the classifier’s decision. Its core innovation lies in a joint optimization framework integrating multi-layer activation matching and abductive constraints: KL divergence aligns intermediate-layer activation distributions between the original and masked inputs, while label preservation loss, L1 sparsity regularization, total variation smoothing, and binary cross-entropy enforce semantic compactness and fidelity. Extensive evaluation across multiple models and datasets demonstrates that the approach effectively suppresses irrelevant background, precisely localizes discriminative regions, and achieves an optimal trade-off among explanation fidelity, minimality, and interpretability. The resulting visual attributions constitute an efficient, verifiable post-hoc explanation mechanism for black-box vision models.

Technology Category

Application Category

πŸ“ Abstract
In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image (x) and a frozen model (f), we train a lightweight autoencoder to output a binary mask (m) such that the explanation (e = m odot x) preserves both the model's prediction and the intermediate activations of (x). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.
Problem

Research questions and friction points this paper is trying to address.

Generates minimal explanations for classifier decisions
Preserves model predictions and intermediate activations
Produces human-interpretable masks by discarding irrelevant regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation matching aligns model distributions for explanations
Lightweight autoencoder generates binary masks for input images
Mask priors ensure minimal compact and faithful explanations
πŸ”Ž Similar Papers
No similar papers found.