ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Traditional multilayer sparse autoencoders suffer from highly coupled activations across layers, leading to dictionary redundancy and unpredictable effects of multilayer interventions. This work proposes the Residualized Sparse Autoencoder (ReSAE), which introduces residual modeling into multilayer training for the first time: by fitting affine mappings between layers and subsequently applying sparse coding only to the unexplained residuals, ReSAE explicitly removes cross-layer linearly predictable structures, enabling each layer to focus on distinct semantic content. Experiments on Pythia-1.4B and Gemma-2-9B demonstrate that ReSAE substantially reduces decoder redundancy and enhances both sparse probing performance and the efficacy of targeted perturbations, particularly under teacher forcing and high sparsity regimes, where multilayer replacement more accurately recovers cross-entropy loss.

📝 Abstract

Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.4B and Gemma-2-9B, residualization reduces decoder redundancy and improves sparse probing and targeted perturbation in most tested settings. Despite reconstructing less of the raw activation variance, ReSAEs recover more transformer cross entropy under multi-layer replacement. This gain is clearest under teacher-forcing and at sufficient sparsity online, indicating that ReSAEs preserve the components of the activation most relevant to the model's downstream computation. These results suggest that removing linearly predictable cross-layer structure is a useful default for multi-layer SAE interventions.

Problem

Research questions and friction points this paper is trying to address.

sparse autoencoders

multi-layer interventions

transformer residual stream

cross-layer redundancy

activation coupling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Residualized Sparse Autoencoders

multi-layer interventions

transformer residual stream