Geometric Analysis of Token Selection in Multi-Head Attention

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of interpretability and quantitative analysis regarding token selection in multi-head attention mechanisms. Viewing standard attention as a top-N token selection process in value space, the work proposes geometric metrics based on Precision, Recall, and F-score to reveal its intrinsic nature as a structured geometric classifier. It identifies three distinct head-level functional patterns: Retriever, Mixer, and Reset. Leveraging non-asymptotic theory under assumptions of stable value norms, compressed sink tokens, and exponential similarity decay, the authors derive dimension- and margin-dependent theoretical bounds. Experiments on LLaMA-2-7B, Gemma-7B, and Mistral-7B validate these predictions, demonstrating that top-N selection significantly enhances token separability and that sink similarity is strongly correlated with Recall.

Technology Category

Application Category

📝 Abstract
We present a geometric framework for analysing multi-head attention in large language models (LLMs). Without altering the mechanism, we view standard attention through a top-N selection lens and study its behaviour directly in value-state space. We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens, and derive non-asymptotic bounds with explicit dependence on dimension and margin under empirically motivated assumptions (stable value norms with a compressed sink token, exponential similarity decay, and piecewise attention weight profiles). The theory predicts a small-N operating regime of strongest non-trivial separability and clarifies how sequence length and sink similarity shape the metrics. Empirically, across LLaMA-2-7B, Gemma-7B, and Mistral-7B, measurements closely track the theoretical envelopes: top-N selection sharpens separability, sink similarity correlates with Recall. We also found that in LLaMA-2-7B heads specialize into three regimes - Retriever, Mixer, Reset - with distinct geometric signatures. Overall, attention behaves as a structured geometric classifier with measurable criteria for token selection, offering head level interpretability and informing geometry-aware sparsification and design of attention in LLMs.
Problem

Research questions and friction points this paper is trying to address.

multi-head attention
token selection
geometric analysis
large language models
value-state space
Innovation

Methods, ideas, or system contributions that make the work stand out.

geometric analysis
multi-head attention
token selection
top-N selection
attention interpretability
🔎 Similar Papers
No similar papers found.
T
Timur Mudarisov
University of Luxembourg
M
Mikhail Burtsev
London Institute for Mathematical Sciences
T
Tatiana Petrova
University of Luxembourg
Radu State
Radu State
University of Luxembourg
Network SecurityNetwork and Service management