🤖 AI Summary
This work investigates the intrinsic cluster-structure discovery capability of Transformer self-attention in unsupervised clustering, focusing on automatic cluster identification under data generated by Gaussian Mixture Models (GMMs). We propose a simplified two-head attention architecture and establish, for the first time, a theoretical guarantee: minimizing the population risk over unlabeled data alone suffices for the attention head parameters to converge to the true GMM mixture centers. Through rigorous analysis of attention weight dynamics, we prove that these weights spontaneously align with the underlying cluster centers—thereby revealing the fundamental principle that “attention is clustering.” Our analysis operates at the population level and provides the first provable theoretical foundation for unsupervised learning in Transformers, formally bridging self-attention mechanisms and clustering theory.
📝 Abstract
Transformers have emerged as a powerful neural network architecture capable of tackling a wide range of learning tasks. In this work, we provide a theoretical analysis of their ability to automatically extract structure from data in an unsupervised setting. In particular, we demonstrate their suitability for clustering when the input data is generated from a Gaussian mixture model. To this end, we study a simplified two-head attention layer and define a population risk whose minimization with unlabeled data drives the head parameters to align with the true mixture centroids.