Residual Stream Analysis with Multi-Layer SAEs

📅 2024-09-06

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limitation of conventional sparse autoencoders (SAEs), whose layer-wise independent training impedes modeling cross-layer information flow. We propose the Multi-Layer Sparse Autoencoder (MLSAE), the first method to jointly model residual stream activations across all Transformer layers. MLSAE learns a shared latent variable space across layers and integrates tuned-lens transformations with inter-layer statistical modeling for analysis. Our experiments reveal three key findings: (1) latent variable activations exhibit strong token dependence and sensitivity to model scale; (2) inter-token variance in layer activations exceeds intra-token variance by two orders of magnitude; and (3) multi-layer collaborative activation becomes markedly more pronounced in larger models. The implementation is publicly available, establishing a novel paradigm for probing internal information flow in language models.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, SAEs are usually trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer. Given that the residual stream is understood to preserve information across layers, we expected MLSAE latents to 'switch on' at a token position and remain active at later layers. Interestingly, we find that individual latents are often active at a single layer for a given token or prompt, but the layer at which an individual latent is active may differ for different tokens or prompts. We quantify these phenomena by defining a distribution over layers and considering its variance. We find that the variance of the distributions of latent activations over layers is about two orders of magnitude greater when aggregating over tokens compared with a single token. For larger underlying models, the degree to which latents are active at multiple layers increases, which is consistent with the fact that the residual stream activation vectors at adjacent layers become more similar. Finally, we relax the assumption that the residual stream basis is the same at every layer by applying pre-trained tuned-lens transformations, but our findings remain qualitatively similar. Our results represent a new approach to understanding how representations change as they flow through transformers. We release our code to train and analyze MLSAEs at https://github.com/tim-lawson/mlsae.

Problem

Research questions and friction points this paper is trying to address.

Analyze transformer layer information flow

Develop multi-layer sparse autoencoder

Quantify latent activation distribution variance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-layer SAE for transformer analysis

Residual stream activation vectors training

Latent activation variance quantification

🔎 Similar Papers

Defending Large Language Models Against Attacks With Residual Stream Activation Analysis