🤖 AI Summary
This work addresses the issue of mode collapse in mean-field Transformers during deep reasoning, where token distributions degenerate into point masses. To counteract this degradation in self-attention, the authors introduce auxiliary variables—specifically positional encodings and fixed prompts—as a regularization mechanism. Leveraging mean-field theory, probabilistic inference, and pushforward measure analysis, they rigorously demonstrate that such auxiliary variables effectively prevent the energy-maximizing distribution from collapsing into a Dirac measure, while retaining universal representational capacity for a broad class of distributions in the limiting regime. Both theoretical analysis and empirical experiments validate the efficacy of this approach, further uncovering the pivotal role of positional encodings in inducing metastable dynamics that sustain distributional diversity.
📝 Abstract
We use a mean-field-based transformer model to theoretically investigate how auxiliary variables, such as positional encoding, prevent mode collapse of self-attention mechanisms. The use of mean-field transformers to analyze the properties of self-attention mechanisms has garnered significant attention in recent years due to their ability to comprehensively analyze token interactions. However, analysis of this simple model suggests that mode collapse, where token distributions degenerate to a single point, occurs during long inferences (i.e., many layers), indicating a discrepancy with reality. This study investigates this mean-field transformer model and demonstrates that the introduction of auxiliary variables, such as positional encoding, acts as a counterforce against theoretical mode collapse. Specifically, we show that in the theoretical scheme, the energy-maximizing distribution does not degenerate to a single point; instead, it is characterized by a pushforward of the auxiliary variable distribution, thereby avoiding concentration in the Dirac measure. Our main examples are the positional encoding and the fixed prompt insertion treated as a parallel auxiliary-variable mechanism. Furthermore, we demonstrate that positional encoding and prompt insertion possess universality of representation in the limit, meaning that the limit distribution of inference can exactly represent a wide class of distributions. We also analyze several key properties of positional encoding and metastability, and validate our theoretical results through mathematical experiments.