Problem
Research questions and friction points this paper is trying to address.
Interpret transformers as probabilistic Laplacian Eigenmaps
Show transformers perform linear dimensionality reduction initially
Improve performance via graph diffusion step in attention
Innovation
Methods, ideas, or system contributions that make the work stand out.
Probabilistic Laplacian Eigenmaps model interpretation
Linear dimensionality reduction at initialization
Graph diffusion step improves validation performance