🤖 AI Summary
This work investigates the depth-width trade-off in Transformers for graph algorithmic tasks, addressing the central question: “Can constant-depth inference be achieved with linear width?” Leveraging circuit complexity analysis, formal attention mechanism modeling, constructive architecture design, and rigorous formal verification, we establish the first proof that linear-width Transformers can exactly solve fundamental graph problems—including connectivity and shortest path—in constant depth. Crucially, we uncover a non-monotonic width-depth relationship: certain graph tasks provably require quadratic width to admit constant-depth solutions. Empirical evaluation confirms that our theory-guided depth-compression framework achieves zero-accuracy loss while substantially accelerating inference—demonstrating both theoretical necessity and practical efficacy of depth reduction under linear-width constraints.
📝 Abstract
Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement a task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We support our theoretical results with empirical evaluations.