Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work investigates the theoretical mechanisms underlying the empirical success of Transformers across diverse tasks, with a focus on their ability to learn from specific teacher models. By analyzing a simplified single-layer Transformer architecture equipped with “position-only” attention and optimized via gradient descent, the study leverages bilinear structural insights to provide the first unified theoretical guarantees for learning classical teacher models—including convolutional layers (with average pooling), graph convolutional layers, and sparse linear predictors. Under mild assumptions, the analysis establishes that the student Transformer can exactly recover the teacher’s parameters, achieve optimal population loss, and generalize well to out-of-distribution data. These results highlight a shared bilinear structural essence across tasks, offering a principled explanation for the versatility and effectiveness of Transformer-based models.

Technology Category

Application Category

📝 Abstract

Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile scenarios and tasks, we theoretically investigate utilizing transformers as students to learn from a class of teacher models. Specifically, the teacher models covered in our analysis include convolution layers with average pooling, graph convolution layers, and various classic statistical learning models, including a variant of sparse token selection models [Sanford et al., 2023, Wang et al., 2024] and group-sparse linear predictors [Zhang et al., 2025]. When learning from this class of teacher models, we prove that one-layer transformers with simplified "position-only'' attention can successfully recover all parameter blocks of the teacher models, thus achieving the optimal population loss. Building upon the efficient mimicry of trained transformers towards teacher models, we further demonstrate that they can generalize well to a broad class of out-of-distribution data under mild assumptions. The key in our analysis is to identify a fundamental bilinear structure shared by various learning tasks, which enables us to establish unified learning guarantees for these tasks when treating them as teachers for transformers.

Problem

Research questions and friction points this paper is trying to address.

Transformers

teacher models

gradient descent

generalization

theoretical foundations

Innovation

Methods, ideas, or system contributions that make the work stand out.

theoretical guarantees

transformer learning

bilinear structure