🤖 AI Summary
Convolutional neural network (CNN) channels often exhibit polysemy—i.e., a single channel encodes multiple semantic concepts—severely undermining interpretability. To address this, we propose a weight reconstruction algorithm that achieves the first structural disentanglement of polysemous channels. Our method clusters activation patterns from preceding layers to identify heterogeneous response modes, decomposes each original channel into multiple semantically specialized sub-channels, and performs reparameterization via convolutional kernel remapping and feature response analysis. Crucially, the approach preserves the original network architecture and enables editable, mechanism-level explanations. Evaluated on an ImageNet subset, it increases per-channel semantic purity by 63% on average, substantially improving feature visualization quality, concept localization accuracy, and attribution reliability. This work establishes a novel paradigm for channel-level interpretability in CNNs.
📝 Abstract
Mechanistic interpretability is concerned with analyzing individual components in a (convolutional) neural network (CNN) and how they form larger circuits representing decision mechanisms. These investigations are challenging since CNNs frequently learn polysemantic channels that encode distinct concepts, making them hard to interpret. To address this, we propose an algorithm to disentangle a specific kind of polysemantic channel into multiple channels, each responding to a single concept. Our approach restructures weights in a CNN, utilizing that different concepts within the same channel exhibit distinct activation patterns in the previous layer. By disentangling these polysemantic features, we enhance the interpretability of CNNs, ultimately improving explanatory techniques such as feature visualizations.