Mechanistic Permutability: Match Features Across Layers

📅 2024-10-10

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Cross-layer feature alignment in deep neural networks is hindered by feature polysemy and superposition, rendering existing sparse autoencoders (SAEs) inadequate for systematic, interpretable inter-layer feature matching. Method: We propose SAE Match—the first data-free method for aligning SAEs across layers—by folding encoder weights and minimizing mean squared error between layer-wise SAE encoders, while explicitly modeling layer-specific activation thresholds to accommodate heterogeneous feature scales. Contribution/Results: Evaluated on Gemma 2, SAE Match significantly improves cross-layer feature alignment fidelity, enabling—for the first time—the identification of high-fidelity, multi-layer persistent interpretable feature trajectories. It also achieves high-accuracy reconstruction of hidden states. By eliminating reliance on activation data and enabling principled inter-layer correspondence, SAE Match establishes a novel paradigm for mechanistic interpretability in large language models.

Technology Category

Application Category

📝 Abstract

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

Problem

Research questions and friction points this paper is trying to address.

Aligning interpretable features across neural network layers.

Addressing polysemanticity and feature superposition in deep networks.

Improving feature matching quality using data-free methods.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAE Match aligns features across neural network layers.

Minimizes mean squared error between SAE parameters.

Incorporates activation thresholds for feature scale differences.

🔎 Similar Papers

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models