Decomposing Query-Key Feature Interactions Using Contrastive Covariances

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing methods struggle to explain why queries and keys in Transformer attention mechanisms assign high attention scores to specific tokens, particularly lacking interpretability in the feature-level interactions between queries and keys. This work proposes a contrastive covariance–based decomposition of the query-key (QK) space, which for the first time disentangles QK interactions into low-rank, human-interpretable subspaces. It reveals that high attention arises from alignment between queries and keys along semantic or binding-related features within these subspaces. Through theoretical analysis and empirical validation on large language models, the method successfully identifies interpretable subspaces corresponding to categorical semantics and binding relationships, enabling feature-level attribution of attention scores and significantly enhancing the transparency and interpretability of attention mechanisms.

Technology Category

Application Category

📝 Abstract

Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space -- the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.

Problem

Research questions and friction points this paper is trying to address.

attention mechanism

query-key interaction

interpretability

Transformer models

feature alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive covariance

query-key decomposition

interpretable attention