🤖 AI Summary
To address the inefficiency of linear recurrence computation and suboptimal GPU memory hierarchy utilization in long-sequence modeling, this paper proposes the Sliding Window Recurrence (SWR) framework. Our method introduces three key innovations: (1) hardware-aligned segmented truncation and a naturally jagged sliding window design that eliminates redundant cross-warp communication; (2) the first hierarchical decomposition of linear recurrences coupled with layered reparameterization; and (3) a modular Phalanx layer that seamlessly replaces either windowed attention or standard linear recurrence. By deeply integrating GPU memory hierarchy–aware algorithmic design with architectural optimizations, SWR achieves 10–40% speedup over optimized Transformers on a 1B-parameter mixture-of-experts model across 4K–32K context lengths, with zero perplexity degradation. This significantly enhances both the efficiency and practicality of long-context inference.
📝 Abstract
Multi-hybrid architectures are poised to take over language modeling due to better quality and performance. We introduce a hierarchical decomposition framework for linear recurrences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-aligned windows which are naturally jagged, limiting costly inter-warp communication. Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity.