Superposition as Lossy Compression: Measure with Sparse Autoencoders and Connect to Adversarial Vulnerability

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work investigates the quantification of superposition in neural networks and its impact on adversarial robustness. Method: We formalize superposition as a lossy compression process and develop an information-theoretic framework grounded in sparse autoencoders and Shannon entropy, defining the “effective feature count” — the minimal number of virtual neurons required for interference-free encoding. Contribution/Results: We provide the first computationally tractable formalization of superposition as compression, revealing its biphasic dependence on task complexity and model capacity (abundance vs. scarcity regimes), thereby revising the conventional view that superposition inherently degrades robustness. Experiments on Pythia-70M and synthetic models demonstrate high-fidelity reconstruction of real superposition, detect superposition deficits in algorithmic tasks, systematic performance degradation under dropout, and abrupt feature crystallization during grokking. Crucially, we show that adversarial training increases the effective feature count and improves robustness.

Technology Category

Application Category

📝 Abstract

Neural networks achieve remarkable performance through superposition: encoding multiple features as overlapping directions in activation space rather than dedicating individual neurons to each feature. This challenges interpretability, yet we lack principled methods to measure superposition. We present an information-theoretic framework measuring a neural representation's effective degrees of freedom. We apply Shannon entropy to sparse autoencoder activations to compute the number of effective features as the minimum neurons needed for interference-free encoding. Equivalently, this measures how many "virtual neurons" the network simulates through superposition. When networks encode more effective features than actual neurons, they must accept interference as the price of compression. Our metric strongly correlates with ground truth in toy models, detects minimal superposition in algorithmic tasks, and reveals systematic reduction under dropout. Layer-wise patterns mirror intrinsic dimensionality studies on Pythia-70M. The metric also captures developmental dynamics, detecting sharp feature consolidation during grokking. Surprisingly, adversarial training can increase effective features while improving robustness, contradicting the hypothesis that superposition causes vulnerability. Instead, the effect depends on task complexity and network capacity: simple tasks with ample capacity allow feature expansion (abundance regime), while complex tasks or limited capacity force reduction (scarcity regime). By defining superposition as lossy compression, this work enables principled measurement of how neural networks organize information under computational constraints, connecting superposition to adversarial robustness.

Problem

Research questions and friction points this paper is trying to address.

Measuring superposition in neural networks using information-theoretic methods

Connecting superposition patterns to adversarial vulnerability in models

Quantifying effective features through sparse autoencoder activations and entropy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using sparse autoencoders to measure superposition via entropy

Defining superposition as lossy compression with effective features

Linking adversarial robustness to task complexity and capacity regimes

🔎 Similar Papers

No similar papers found.