🤖 AI Summary
Standard Transformer attention incurs O(n²) time complexity, severely hindering efficient modeling of long sequences. This paper proposes WERSA (Wavelet-Enhanced Random Spectral Attention), the first attention mechanism that integrates multi-resolution Haar wavelet analysis with content-adaptive random spectral features to enable dynamic scale selection and linear-complexity (O(n)) attention computation. Built upon a learnable multi-head architecture, WERSA supports memory-efficient training on a single GPU. On multiple long-sequence benchmark tasks, WERSA achieves state-of-the-art accuracy—79.1% on 128k-length sequences—while reducing training time by 81% and FLOPs by 73.4% compared to prior linear attention methods, demonstrating substantial improvements in both accuracy and efficiency.
📝 Abstract
Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency.
Large-scale comparisons extbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2% (86.2% vs 85.0%) while cutting training time by 81% (296s vs 1554s) and FLOPS by 73.4% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k's extremely lengthy sequences, it achieves best accuracy (79.1%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being extbf{twice as fast} as Waveformer, its next-best competitor.
By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development.