MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The quadratic computational complexity of Transformer attention severely hampers inference efficiency on long sequences. To address this, we propose MonarchAttention—a full-layer replacement for standard attention that approximates the attention mechanism using Monarch matrix structure, enabling zero-training, plug-and-play model compression for the first time. Our method integrates variational Softmax projection, hardware-aware sparsity, and custom CUDA kernels natively optimized for GPU tensor cores. It preserves the original model’s accuracy while significantly improving hardware efficiency. Evaluated across sequence lengths from 256 to 16K, MonarchAttention achieves 1.4×–8.2× speedup over FlashAttention-2. It supports multimodal tasks without architectural modification and delivers a structured, end-to-end deployable attention paradigm—offering high accuracy, low computational overhead, and seamless integration for long-context large language model inference.

Technology Category

Application Category

📝 Abstract
Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $Theta(Nsqrt{N} d)$ computational complexity and $Theta(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4 imes$ for shorter sequences $(N=256)$, $4.5 imes$ for medium-length sequences $(N=4K)$, and $8.2 imes$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.
Problem

Research questions and friction points this paper is trying to address.

Reduces quadratic complexity of Transformer attention mechanisms
Enables efficient hardware-aware attention approximation via Monarch matrices
Achieves significant speed-ups without performance loss or retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monarch matrices enable sub-quadratic attention approximation
Optimization-based algorithm reduces computational complexity
Hardware-efficient design maximizes GPU throughput
🔎 Similar Papers
No similar papers found.
Can Yaras
Can Yaras
PhD Student, University of Michigan
Deep LearningOptimization
A
Alec S. Xu
Department of Electrical Engineering & Computer Science, University of Michigan
P
Pierre Abillama
Department of Electrical Engineering & Computer Science, University of Michigan
C
Changwoo Lee
Department of Electrical Engineering & Computer Science, University of Michigan
Laura Balzano
Laura Balzano
University of Michigan, Ann Arbor
matrix factorizationmatrix completionmanifold optimizationnonconvex optimization