Emergence of Frontier Superposition: Möbius attractor and Cascade Supervision

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work investigates whether gradient descent can spontaneously converge to the equal-weight superposition state required for graph reachability reasoning, despite saddle-point challenges induced by symmetry. By disentangling the roles of architecture and supervision, we establish—for the first time—the existence of a global optimum manifold containing the equal-weight superposition, which we identify as a Möbius attractor. To mitigate gradient vanishing in end-to-end training, we introduce a cascaded supervision mechanism. Our analysis leverages layer-wise dynamics under Sₙ symmetry, one-dimensional Möbius map theory, and gradient persistence principles. Experiments on Erdős–Rényi graphs demonstrate that with depth D=3, cascaded supervision achieves a final-step cosine similarity of 0.69, substantially outperforming the 0.37 attained by standard end-to-end training, thereby confirming that hyper-positional reasoning can emerge through the synergy of Möbius attractors and cascaded supervision.

📝 Abstract

Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erdős-Rényi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a Möbius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: Möbius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.

Problem

Research questions and friction points this paper is trying to address.

superposition

graph reachability

gradient descent

permutation symmetry

reasoning frontier

Innovation

Methods, ideas, or system contributions that make the work stand out.

Möbius attractor

Cascade Supervision

superposition reasoning