🤖 AI Summary
This work addresses the limitations of conventional Transformers in effectively controlling the variance of token representations and lacking explicit guidance toward an ideal embedding geometry. The authors propose Laplacian Attention, a novel mechanism that, for the first time, integrates token variance regulation with the Neural Collapse phenomenon. This approach encourages embeddings of tokens from the same class to collapse toward their class mean while simultaneously arranging the class means into a maximally separable geometric configuration. Through the synergistic use of Laplacian Attention, principal component analysis, and Neural Collapse metrics, the method consistently achieves performance gains across multiple vision and language benchmarks, thereby validating the efficacy and superiority of the induced representational geometry.
📝 Abstract
Transformers leverage attention, the residual connection, and layer normalization to control the variance of token representations. We propose to modify attention into a Laplacian mechanism that gives the model more direct control over token variance. We conjecture that this helps transformers achieve the ideal token geometry. To investigate our conjecture, we first show that incorporating the Laplacian mechanism into transformers induces consistent improvements across benchmarks in computer vision and language. Next, we study how the Laplacian mechanism impacts the geometry of token representations using various tools: 1) principal component analysis, 2) cosine similarity metric, 3) analysis of variance, and 4) Neural Collapse metrics. Our investigation shows that the Laplacian mechanism reshapes token embeddings toward a geometry of maximal separability: tokens collapse according to their classes, and the class means exhibit Neural Collapse.