🤖 AI Summary
This study systematically investigates the dual impact of model pruning on neural network interpretability—quantifying fidelity, sparsity, and semantic consistency across two interpretability dimensions: low-level saliency maps and high-level concept representations. Using magnitude-based pruning and fine-tuning on ResNet-18 trained on ImageNette, we analyze saliency via Vanilla Gradients and Integrated Gradients, and extract human-aligned concepts using CRAFT. Results show that light-to-moderate pruning (≤40% sparsity) improves saliency map focus, enhances concept disentanglement, and strengthens alignment with human cognition. In contrast, aggressive pruning—even when preserving predictive accuracy—induces feature entanglement, saliency distortion, and semantic degradation of learned concepts. Crucially, this work provides the first empirical evidence of “performance–interpretability decoupling”: model accuracy and interpretability do not co-vary monotonically under pruning. These findings establish critical theoretical foundations and practical design boundaries for interpretable model compression in trustworthy AI.
📝 Abstract
Prior works have shown that neural networks can be heavily pruned while preserving performance, but the impact of pruning on model interpretability remains unclear. In this work, we investigate how magnitude-based pruning followed by fine-tuning affects both low-level saliency maps and high-level concept representations. Using a ResNet-18 trained on ImageNette, we compare post-hoc explanations from Vanilla Gradients (VG) and Integrated Gradients (IG) across pruning levels, evaluating sparsity and faithfulness. We further apply CRAFT-based concept extraction to track changes in semantic coherence of learned concepts. Our results show that light-to-moderate pruning improves saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. In contrast, aggressive pruning merges heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy. These findings suggest that while pruning can shape internal representations toward more human-aligned attention patterns, excessive pruning undermines interpretability.