On the Emergence of Position Bias in Transformers

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Transformer models exhibit a pervasive early-position bias, yet the underlying mechanisms—particularly how causal masking interacts with positional encodings (e.g., RoPE, decaying masks)—remain poorly understood. Method: We propose the first directed-graph-based multilayer attention analysis framework to model the cross-layer accumulation of positional influence. Leveraging theoretical derivation and empirical reproduction, we characterize phenomena such as “middle-token dropout” and attention sinks. Contribution/Results: We formally establish a dynamic trade-off between mask structure and positional encoding, and quantitatively validate—under standard LLM configurations—a systematic inverse relationship between long-range attenuation and head-position importance. Our work provides a unified graph-theoretic modeling toolkit and theoretical foundation for analyzing and mitigating positional bias in Transformers.

Technology Category

Application Category

📝 Abstract

Recent studies have revealed various manifestations of position bias in transformer architectures, from the"lost-in-the-middle"phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers -- coupled with the causal mask -- leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.

Problem

Research questions and friction points this paper is trying to address.

Analyze position bias in transformers

Model attention masks as directed graphs

Understand interplay of causal mask and encodings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-theoretic framework analyzes position bias

Model attention masks as directed graphs

Quantify token interaction based on position

🔎 Similar Papers

No similar papers found.