Training Neural Networks for Modularity aids Interpretability

📅 2024-09-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained neural networks suffer from poor modularity—i.e., limited “clustability”—making it difficult to decompose them into semantically coherent and functionally independent subnetworks, hindering interpretability. Method: This paper proposes an end-to-end learnable modularization enhancement framework. Its core innovation is the enmeshment loss, the first training objective explicitly optimizing both semantic decoupling and structural separation of neuron clusters. The method integrates cluster-separability–driven supervision with an automated interpretability evaluation framework to perform modular regularized training on CIFAR-10. Results: Experiments demonstrate that the resulting neuron clusters are smaller in size, more discretely distributed, and exhibit stronger functional complementarity. Consequently, local computational circuits achieve significantly improved semantic clarity, behavioral analyzability, and formal verifiability—advancing neural network interpretability through principled modularization.

Technology Category

Application Category

📝 Abstract
An approach to improve network interpretability is via clusterability, i.e., splitting a model into disjoint clusters that can be studied independently. We find pretrained models to be highly unclusterable and thus train models to be more modular using an ``enmeshment loss'' function that encourages the formation of non-interacting clusters. Using automated interpretability measures, we show that our method finds clusters that learn different, disjoint, and smaller circuits for CIFAR-10 labels. Our approach provides a promising direction for making neural networks easier to interpret.
Problem

Research questions and friction points this paper is trying to address.

Improving neural network interpretability via modularity
Training models with enmeshment loss for better clusterability
Finding disjoint circuits for CIFAR-10 labels
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses enmeshment loss for modularity
Encourages non-interacting cluster formation
Improves interpretability via disjoint circuits
🔎 Similar Papers
No similar papers found.