🤖 AI Summary
This work investigates whether gradient descent can spontaneously converge to the equal-weight superposition state required for graph reachability reasoning, despite saddle-point challenges induced by symmetry. By disentangling the roles of architecture and supervision, we establish—for the first time—the existence of a global optimum manifold containing the equal-weight superposition, which we identify as a Möbius attractor. To mitigate gradient vanishing in end-to-end training, we introduce a cascaded supervision mechanism. Our analysis leverages layer-wise dynamics under Sₙ symmetry, one-dimensional Möbius map theory, and gradient persistence principles. Experiments on Erdős–Rényi graphs demonstrate that with depth D=3, cascaded supervision achieves a final-step cosine similarity of 0.69, substantially outperforming the 0.37 attained by standard end-to-end training, thereby confirming that hyper-positional reasoning can emerge through the synergy of Möbius attractors and cascaded supervision.
📝 Abstract
Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles.
We close this gap on Reachability-by-Superposition over Erdős-Rényi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a Möbius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state.
On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached.
Our thesis: Möbius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.