LBI: Parallel Scan Backpropagation via Latent Bounded Interfaces

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Backpropagation exhibits sequential dependencies along the depth dimension, resulting in an O(K) computation chain that limits parallel training efficiency. This work proposes Latent Bounded Interfaces—a low-dimensional mechanism that compresses inter-region communication into a subspace significantly smaller than the hidden dimension—and achieves, for the first time, efficient scan-based backpropagation with exact gradients. By leveraging suffix scans of backward adjoint recursions and fixed-size matrix communication, the method reduces the computational complexity of composing Jacobian matrices from O(d³) to O(r³). Evaluated across four architectures, the approach incurs only 0.16–0.35 cross-entropy units of performance loss using a mere r=16-dimensional interface, with per-backward communication volume of approximately 56 KB, thereby establishing a new algorithmic foundation for region-parallel training.

📝 Abstract

Backpropagation is inherently sequential across depth, creating an $O(K)$-deep dependency chain that bottlenecks parallel training. While parallel-scan formulations theoretically reduce this depth to $O(\log K)$, they are computationally prohibitive for modern architectures due to the $O(d^3)$ cost of composing full-rank $d\times d$ Jacobians over the entire hidden state. We introduce Latent Bounded Interfaces (LBI), an algorithmic formulation that makes scan-based backpropagation tractable by restricting inter-region communication to a low-dimensional latent interface, $ m_k \in \mathbb{R}^{r}$, where $r \ll d$. This reduces the adjoint recursion to a suffix scan over $r \times r$ Jacobians, cutting per-combine cost from $O(d^3)$ to $O(r^3)$ while preserving exact gradients under the bounded-interface model. We demonstrate that LBI maintains model quality across four architectures (Mamba-2, Mamba-3, Transformer, and a Mamba--Transformer hybrid) at 47--61M block parameters. Interfaces of dimension $r=16$ suffice to preserve training quality within 0.16--0.35 cross entropy of dense baselines. The resulting framework provides an algorithmic foundation for region-parallel training, reducing cross-device backward communication to a single scan over $K$ fixed-size matrices, of approximately 56 KB for our experimental configurations.

Problem

Research questions and friction points this paper is trying to address.

backpropagation

parallel training

computational bottleneck

Jacobian composition

sequential dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Bounded Interfaces

parallel backpropagation

scan-based training