Approaching I/O-optimality for Approximate Attention

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

This work addresses the quadratic I/O complexity of attention mechanisms in large language models with respect to sequence length $ n $, which has become a critical performance bottleneck. Inspired by the approximate attention framework of Alman and Song, the authors propose the first algorithm that achieves near-linear I/O complexity across a broad range of parameter settings. They establish a matching I/O lower bound within this regime, approaching the information-theoretic limit of $ \Omega(nd) $. By designing a block-wise strategy based on approximate attention, coupled with optimized data scheduling and efficient external memory access mechanisms, the method substantially reduces I/O overhead. Both theoretical analysis and empirical evaluation demonstrate that the approach is nearly I/O optimal in practice.

📝 Abstract

We revisit the I/O complexity of attention in large language models. Given query-key-value matrices $Q,K,V\in\mathbb{R}^{n\times d}$, and a machine with fast memory size $M$, the goal is to compute the "attention matrix" $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$ with the minimal number of data transfers between fast and slow memory. Existing methods in the literature, most notably FlashAttention and its variants, incur an I/O cost that depends quadratically on $n$, while a trivial lower bound only requires $Ω(nd)$ I/O's to read the inputs and write the output. In this work, we present a technique for computing attention where the I/O cost only depends almost-linearly on $n$ in most parameter regimes. This is achieved by developing I/O-efficient algorithms inspired by the recent approximate attention framework of Alman and Song. We also prove corresponding lower bounds in each parameter regime to show that our algorithms are indeed close to I/O-optimal.

Problem

Research questions and friction points this paper is trying to address.

I/O complexity

attention mechanism

large language models

memory hierarchy

approximate attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

I/O complexity

approximate attention

large language models