Sliding Window Recurrences for Sequence Models

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the inefficiency of linear recurrence computation and suboptimal GPU memory hierarchy utilization in long-sequence modeling, this paper proposes the Sliding Window Recurrence (SWR) framework. Our method introduces three key innovations: (1) hardware-aligned segmented truncation and a naturally jagged sliding window design that eliminates redundant cross-warp communication; (2) the first hierarchical decomposition of linear recurrences coupled with layered reparameterization; and (3) a modular Phalanx layer that seamlessly replaces either windowed attention or standard linear recurrence. By deeply integrating GPU memory hierarchy–aware algorithmic design with architectural optimizations, SWR achieves 10–40% speedup over optimized Transformers on a 1B-parameter mixture-of-experts model across 4K–32K context lengths, with zero perplexity degradation. This significantly enhances both the efficiency and practicality of long-context inference.

Technology Category

Application Category

📝 Abstract

Multi-hybrid architectures are poised to take over language modeling due to better quality and performance. We introduce a hierarchical decomposition framework for linear recurrences that allows us to develop algorithms aligned with GPU memory hierarchies, yielding Sliding Window Recurrences. We focus specifically on truncating recurrences to hardware-aligned windows which are naturally jagged, limiting costly inter-warp communication. Using SWR, we develop Phalanx layers that serve as drop-in replacements for windowed attention or linear recurrences. In 1B parameter multi-hybrid models, Phalanx achieves over 10-40% speedup across 4K to 32K context length over optimized Transformers while matching perplexity.

Problem

Research questions and friction points this paper is trying to address.

Develop sliding window recurrences for efficient sequence modeling

Optimize linear recurrences for GPU memory hierarchy alignment

Replace attention mechanisms with faster, perplexity-matching layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical decomposition for GPU-aligned linear recurrences

Truncating recurrences to jagged hardware-aligned windows

Phalanx layers replace attention or recurrences for speedup

🔎 Similar Papers

No similar papers found.