🤖 AI Summary
Backpropagation exhibits sequential dependencies along the depth dimension, resulting in an O(K) computation chain that limits parallel training efficiency. This work proposes Latent Bounded Interfaces—a low-dimensional mechanism that compresses inter-region communication into a subspace significantly smaller than the hidden dimension—and achieves, for the first time, efficient scan-based backpropagation with exact gradients. By leveraging suffix scans of backward adjoint recursions and fixed-size matrix communication, the method reduces the computational complexity of composing Jacobian matrices from O(d³) to O(r³). Evaluated across four architectures, the approach incurs only 0.16–0.35 cross-entropy units of performance loss using a mere r=16-dimensional interface, with per-backward communication volume of approximately 56 KB, thereby establishing a new algorithmic foundation for region-parallel training.
📝 Abstract
Backpropagation is inherently sequential across depth, creating an $O(K)$-deep dependency chain that bottlenecks parallel training. While parallel-scan formulations theoretically reduce this depth to $O(\log K)$, they are computationally prohibitive for modern architectures due to the $O(d^3)$ cost of composing full-rank $d\times d$ Jacobians over the entire hidden state. We introduce Latent Bounded Interfaces (LBI), an algorithmic formulation that makes scan-based backpropagation tractable by restricting inter-region communication to a low-dimensional latent interface, $ m_k \in \mathbb{R}^{r}$, where $r \ll d$. This reduces the adjoint recursion to a suffix scan over $r \times r$ Jacobians, cutting per-combine cost from $O(d^3)$ to $O(r^3)$ while preserving exact gradients under the bounded-interface model. We demonstrate that LBI maintains model quality across four architectures (Mamba-2, Mamba-3, Transformer, and a Mamba--Transformer hybrid) at 47--61M block parameters. Interfaces of dimension $r=16$ suffice to preserve training quality within 0.16--0.35 cross entropy of dense baselines. The resulting framework provides an algorithmic foundation for region-parallel training, reducing cross-device backward communication to a single scan over $K$ fixed-size matrices, of approximately 56 KB for our experimental configurations.