Demystifying the Slash Pattern in Attention: The Role of RoPE

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This study investigates the origin and generalization capability of slash-dominant attention heads in large language models. By integrating empirical analysis, matrix rank characterization, frequency-domain decomposition of Rotary Position Embedding (RoPE), and gradient dynamics theory in shallow Transformers, the work reveals—through the lens of modeling assumptions and training dynamics—that the interaction between the low-rank structure of query and key vectors and the high-frequency components of RoPE constitutes a sufficient condition for the emergence of diagonal-slash attention patterns. The research not only demonstrates the prevalence of this phenomenon across mainstream open-source large language models but also provides theoretical proof of its inevitability under specified conditions, further showing that such heads maintain strong generalization performance even on out-of-distribution prompts.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $\Delta$-th sub-diagonal for some offset $\Delta$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.

Problem

Research questions and friction points this paper is trying to address.

Slash Attention

RoPE

Attention Mechanism

Large Language Models

Transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Slash Attention

RoPE

Rank-One Queries/Keys