Activation Matching for Explanation Generation

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of generating minimal, high-fidelity causal explanations for pre-trained image classifiers. The proposed method trains a lightweight autoencoder to produce binary masks that retain only the image regions critical to the classifier’s decision. Its core innovation lies in a joint optimization framework integrating multi-layer activation matching and abductive constraints: KL divergence aligns intermediate-layer activation distributions between the original and masked inputs, while label preservation loss, L1 sparsity regularization, total variation smoothing, and binary cross-entropy enforce semantic compactness and fidelity. Extensive evaluation across multiple models and datasets demonstrates that the approach effectively suppresses irrelevant background, precisely localizes discriminative regions, and achieves an optimal trade-off among explanation fidelity, minimality, and interpretability. The resulting visual attributions constitute an efficient, verifiable post-hoc explanation mechanism for black-box vision models.

Technology Category

Application Category

📝 Abstract

In this paper we introduce an activation-matching--based approach to generate minimal, faithful explanations for the decision-making of a pretrained classifier on any given image. Given an input image (x) and a frozen model (f), we train a lightweight autoencoder to output a binary mask (m) such that the explanation (e = m odot x) preserves both the model's prediction and the intermediate activations of (x). Our objective combines: (i) multi-layer activation matching with KL divergence to align distributions and cross-entropy to retain the top-1 label for both the image and the explanation; (ii) mask priors -- L1 area for minimality, a binarization penalty for crisp 0/1 masks, and total variation for compactness; and (iii) abductive constraints for faithfulness and necessity. Together, these objectives yield small, human-interpretable masks that retain classifier behavior while discarding irrelevant input regions, providing practical and faithful minimalist explanations for the decision making of the underlying model.

Problem

Research questions and friction points this paper is trying to address.

Generates minimal explanations for classifier decisions

Preserves model predictions and intermediate activations

Produces human-interpretable masks by discarding irrelevant regions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Activation matching aligns model distributions for explanations

Lightweight autoencoder generates binary masks for input images

Mask priors ensure minimal compact and faithful explanations

🔎 Similar Papers

Faithful and Plausible Natural Language Explanations for Image Classification: A Pipeline Approach