🤖 AI Summary
To address the poor length generalization and explicit positional encoding dependency of softmax self-attention in long sequences, this paper proposes a novel attention mechanism grounded in the stick-breaking process. The mechanism inherently models sequential order and recency bias, enabling length-adaptive generalization without positional embeddings. This work constitutes the first application of the probabilistic stick-breaking process to attention modeling. We further design a numerically stable algorithm and a Flash Attention-compatible kernel to support efficient training and inference. Experiments demonstrate strong zero-shot length generalization: models trained on sequences of length $2^{11}$ generalize effectively to inference lengths up to $2^{14}$, achieving significantly reduced perplexity. Downstream task performance matches or exceeds that of RoPE-enhanced softmax baselines, thereby overcoming fundamental limitations of conventional softmax attention in long-range modeling.
📝 Abstract
The self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods using still face length generalisation challenges. We propose an alternative attention mechanism based on the stick-breaking process: For each token before the current, we determine a break point $eta_{i,j}$, which represents the proportion of the remaining stick to allocate to the current token. We repeat the process until the stick is fully allocated, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing (Shen et. al., 2017). We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention. We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism. When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks. Stick-breaking also performs well at length generalisation, allowing a model trained with $2^{11}$ context window to perform well at $2^{14}$ with perplexity improvements.