When Do Transformers Learn Heuristics for Graph Connectivity?

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
When do Transformers learn generalizable graph connectivity algorithms rather than brittle, degree-based heuristics? Method: We propose a decoupled Transformer architecture and provide a theoretical analysis showing that an L-layer variant exactly decides connectivity for graphs with diameter ≤ 3^L. We further characterize the critical role of alignment between training data distribution (e.g., maximum graph diameter) and model capacity in enabling algorithmic generalization. Using adjacency matrix power modeling and training dynamics analysis, we empirically validate that both standard and decoupled Transformers converge to exact algorithms when training graphs’ diameters respect the model’s theoretical capacity. Contribution/Results: This work establishes the first quantitative link among Transformer architecture, data distribution, and algorithm learning—providing both theoretical guarantees and empirical evidence for algorithmic inductive bias in large language models. It identifies precise conditions under which Transformers generalize algorithmically, advancing our understanding of their computational capabilities beyond heuristic memorization.

Technology Category

Application Category

📝 Abstract
Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $leq 3^L$) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model's capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.
Problem

Research questions and friction points this paper is trying to address.

Explains when transformers learn algorithms versus brittle heuristics
Analyzes graph connectivity learning capacity in simplified transformer architectures
Shows training data scope determines algorithm or heuristic learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplified Transformer architecture for graph connectivity
Capacity to solve graphs with diameter up to 3^L
Training within capacity enables exact algorithm learning
🔎 Similar Papers
No similar papers found.