Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA)

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Standard Transformer attention incurs O(n²) time complexity, severely hindering efficient modeling of long sequences. This paper proposes WERSA (Wavelet-Enhanced Random Spectral Attention), the first attention mechanism that integrates multi-resolution Haar wavelet analysis with content-adaptive random spectral features to enable dynamic scale selection and linear-complexity (O(n)) attention computation. Built upon a learnable multi-head architecture, WERSA supports memory-efficient training on a single GPU. On multiple long-sequence benchmark tasks, WERSA achieves state-of-the-art accuracy—79.1% on 128k-length sequences—while reducing training time by 81% and FLOPs by 73.4% compared to prior linear attention methods, demonstrating substantial improvements in both accuracy and efficiency.

Technology Category

Application Category

📝 Abstract

Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons extbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2% (86.2% vs 85.0%) while cutting training time by 81% (296s vs 1554s) and FLOPS by 73.4% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k's extremely lengthy sequences, it achieves best accuracy (79.1%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being extbf{twice as fast} as Waveformer, its next-best competitor. By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development.

Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic time complexity in Transformer attention mechanisms

Enables efficient processing of very long sequences with linear complexity

Improves accuracy while significantly cutting computational costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear time complexity with WERSA

Combines random spectral features and wavelets

Enables efficient long-sequence processing

🔎 Similar Papers

Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences

2023-10-18arXiv.orgCitations: 9

ByteDance

United States / China / Singapore

Research Engineer, Monetization AI