🤖 AI Summary
This paper investigates whether Transformers, under unsupervised pretraining, can autonomously learn spectral algorithms without task-specific supervision.
Method: We introduce an algorithm-unfolding modeling paradigm—distinct from in-context learning—that formalizes how Transformers implicitly acquire algorithmic knowledge through experience-like training. Leveraging spectral analysis and Gaussian mixture model (GMM) theory, we provide constructive theoretical proofs that multi-layer Transformers can implement principal component analysis (PCA) and GMM-based clustering, without prompting or downstream fine-tuning.
Contribution/Results: We establish, for the first time, a provable correspondence between the Transformer architecture and classical spectral methods. Our analysis yields convergence guarantees, and empirical evaluation on both synthetic and real-world datasets demonstrates performance approaching statistical optimality. The work reveals an intrinsic algorithm-learning mechanism within Transformers, bridging architectural design and principled unsupervised algorithm discovery.
📝 Abstract
Transformers demonstrate significant advantages as the building block of modern LLMs. In this work, we study the capacities of Transformers in performing unsupervised learning. We show that multi-layered Transformers, given a sufficiently large set of pre-training instances, are able to learn the algorithms themselves and perform statistical estimation tasks given new instances. This learning paradigm is distinct from the in-context learning setup and is similar to the learning procedure of human brains where skills are learned through past experience. Theoretically, we prove that pre-trained Transformers can learn the spectral methods and use the classification of bi-class Gaussian mixture model as an example. Our proof is constructive using algorithmic design techniques. Our results are built upon the similarities of multi-layered Transformer architecture with the iterative recovery algorithms used in practice. Empirically, we verify the strong capacity of the multi-layered (pre-trained) Transformer on unsupervised learning through the lens of both the PCA and the Clustering tasks performed on the synthetic and real-world datasets.