Causal Interpretation of Neural Network Computations with Contribution Decomposition

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to uncover the causal influence of hidden neurons on neural network outputs, as activation patterns alone are insufficient to decipher internal computational mechanisms. This work proposes CODEC, a novel method that—unlike prior activation-based analyses—decouples network behavior into interpretable, sparse contribution modes by integrating contribution decomposition with sparse autoencoders. Applying this framework, the study reveals cross-layer causal computation pathways and demonstrates its efficacy in both image classification and retinal neural activity modeling. The approach enables precise intervention and visualization of intermediate layers, uncovers a progressive decoupling of positive and negative contributions in deeper layers, and elucidates how compositional interactions among intermediate neurons give rise to dynamic receptive fields.

Technology Category

Application Category

📝 Abstract
Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.
Problem

Research questions and friction points this paper is trying to address.

causal interpretation
neural network computations
contribution decomposition
mechanistic interpretability
hidden neuron contributions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contribution Decomposition
Sparse Autoencoders
Causal Interpretation
Neural Network Mechanisms
Interpretable AI
🔎 Similar Papers
No similar papers found.
J
Joshua Brendan Melander
Department of Neurobiology, Stanford University
Z
Zaki Alaoui
Department of Neurobiology, Stanford University
Shenghua Liu
Shenghua Liu
Institute of Computing Technology, Chinese Academy of Sciences
trustworthy foundation modelbig graph mininganomaly detectionrobust learning
Surya Ganguli
Surya Ganguli
Associate Professor, Stanford University
NeurosciencePhysicsMachine Learning
S
Stephen A. Baccus
Department of Neurobiology, Stanford University