🤖 AI Summary
When do Transformers learn generalizable graph connectivity algorithms rather than brittle, degree-based heuristics?
Method: We propose a decoupled Transformer architecture and provide a theoretical analysis showing that an L-layer variant exactly decides connectivity for graphs with diameter ≤ 3^L. We further characterize the critical role of alignment between training data distribution (e.g., maximum graph diameter) and model capacity in enabling algorithmic generalization. Using adjacency matrix power modeling and training dynamics analysis, we empirically validate that both standard and decoupled Transformers converge to exact algorithms when training graphs’ diameters respect the model’s theoretical capacity.
Contribution/Results: This work establishes the first quantitative link among Transformer architecture, data distribution, and algorithm learning—providing both theoretical guarantees and empirical evidence for algorithmic inductive bias in large language models. It identifies precise conditions under which Transformers generalize algorithmically, advancing our understanding of their computational capabilities beyond heuristic memorization.
📝 Abstract
Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $leq 3^L$) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.