🤖 AI Summary
Existing methods struggle to explain why queries and keys in Transformer attention mechanisms assign high attention scores to specific tokens, particularly lacking interpretability in the feature-level interactions between queries and keys. This work proposes a contrastive covariance–based decomposition of the query-key (QK) space, which for the first time disentangles QK interactions into low-rank, human-interpretable subspaces. It reveals that high attention arises from alignment between queries and keys along semantic or binding-related features within these subspaces. Through theoretical analysis and empirical validation on large language models, the method successfully identifies interpretable subspaces corresponding to categorical semantics and binding relationships, enabling feature-level attribution of attention scores and significantly enhancing the transparency and interpretability of attention mechanisms.
📝 Abstract
Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.