🤖 AI Summary
This work addresses the pathological attention mechanisms and insufficient stability observed during decoding in large language models, which stem from a lack of theoretical understanding of token space structure. The authors reformulate causal self-attention as a stochastic process within a probabilistic framework, thereby revealing— for the first time—the geometric structure of token space and its stability boundaries. They introduce the notion of “support tokens,” analogous to support vectors in SVMs, to characterize stability margins. Methodologically, the approach integrates probabilistic PCA, variable transformation, and Bayesian MAP estimation, while incorporating a smooth log-barrier regularizer into the cross-entropy loss. This framework not only establishes a novel powerset-based paradigm for sequence modeling but also significantly enhances model robustness without compromising out-of-distribution predictive performance.
📝 Abstract
Self-attention is usually described as a flexible, content-adaptive way to mix a token with information from its past. We re-interpret causal self-attention transformers, the backbone of modern foundation models, within a probabilistic framework, much like how classical PCA is extended to probabilistic PCA. However, this re-formulation reveals a surprising and deeper structural insight: due to a change-of-variables phenomenon, a barrier constraint emerges on the self-attention parameters. This induces a highly structured geometry on the token space, providing theoretical insights into the dynamics of LLM decoding. This reveals a boundary where attention becomes ill-conditioned, leading to a margin interpretation similar to classical support vector machines. Just like support vectors, this naturally gives rise to the concept of support tokens.
Furthermore, we show that LLMs can be interpreted as a stochastic process over the power set of the token space, providing a rigorous probabilistic framework for sequence modeling. We propose a Bayesian framework and derive a MAP estimation objective that requires only a minimal modification to standard LLM training: the addition of a smooth log-barrier penalty to the usual cross-entropy loss. We demonstrate that this provides more robust models without sacrificing out-of-sample accuracy and that it is straightforward to incorporate in practice.