Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work challenges the default assumption in interpretability research that sparse autoencoder (SAE)-learned features are inherently superior to the key-value memory–based representations intrinsically encoded in Transformer feed-forward (FF) layers. We systematically analyze FF parameters through a key-value memory lens to uncover their semantic structure and conduct a multi-dimensional comparison with SAEs across three axes: feature quality, faithfulness, and interpretability—using modern interpretability evaluation benchmarks. Results show that FF features match or exceed SAEs on most metrics, particularly in certain downstream tasks; the features learned by the two methods exhibit substantial semantic divergence; and FF parameters constitute a strong, model-intrinsic baseline for interpretability studies. This is the first empirical demonstration of intrinsic interpretability within FF layers, advocating a methodological shift toward grounding interpretability analysis directly in model architecture rather than external probes.

Technology Category

Application Category

📝 Abstract

Recent interpretability work on large language models (LLMs) has been increasingly dominated by a feature-discovery approach with the help of proxy modules. Then, the quality of features learned by, e.g., sparse auto-encoders (SAEs), is evaluated. This paradigm naturally raises a critical question: do such learned features have better properties than those already represented within the original model parameters, and unfortunately, only a few studies have made such comparisons systematically so far. In this work, we revisit the interpretability of feature vectors stored in feed-forward (FF) layers, given the perspective of FF as key-value memories, with modern interpretability benchmarks. Our extensive evaluation revealed that SAE and FFs exhibits a similar range of interpretability, although SAEs displayed an observable but minimal improvement in some aspects. Furthermore, in certain aspects, surprisingly, even vanilla FFs yielded better interpretability than the SAEs, and features discovered in SAEs and FFs diverged. These bring questions about the advantage of SAEs from both perspectives of feature quality and faithfulness, compared to directly interpreting FF feature vectors, and FF key-value parameters serve as a strong baseline in modern interpretability research.

Problem

Research questions and friction points this paper is trying to address.

Comparing interpretability of SAE features versus original FF layer parameters

Evaluating if learned features surpass native model feature representations

Assessing feature quality and faithfulness between SAEs and key-value memories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer key-value memories interpretability compared to sparse autoencoders

Feed-forward layers as key-value memories for feature analysis

Evaluating interpretability of original model parameters versus learned features

🔎 Similar Papers

No similar papers found.