๐ค AI Summary
This work investigates the quantification of superposition in neural networks and its impact on adversarial robustness. Method: We formalize superposition as a lossy compression process and develop an information-theoretic framework grounded in sparse autoencoders and Shannon entropy, defining the โeffective feature countโ โ the minimal number of virtual neurons required for interference-free encoding. Contribution/Results: We provide the first computationally tractable formalization of superposition as compression, revealing its biphasic dependence on task complexity and model capacity (abundance vs. scarcity regimes), thereby revising the conventional view that superposition inherently degrades robustness. Experiments on Pythia-70M and synthetic models demonstrate high-fidelity reconstruction of real superposition, detect superposition deficits in algorithmic tasks, systematic performance degradation under dropout, and abrupt feature crystallization during grokking. Crucially, we show that adversarial training increases the effective feature count and improves robustness.
๐ Abstract
Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.