🤖 AI Summary
Standard Transformers struggle with high-resolution inputs due to the quadratic complexity of self-attention, and existing patching or downsampling strategies often compromise fine-grained details. To address this, we propose the Multipole Attention Neural Operator (MANO), which models attention as distance-based, multi-scale interactions among grid points—introducing the fast multipole method from *n*-body simulations into Transformer architecture for the first time. MANO preserves a global receptive field per attention head while achieving linear time and memory complexity, eliminating the need for downsampling and retaining the finest structural details. By integrating structural priors from both vision and physical simulation, it balances local precision with global modeling. On image classification and Darcy flow simulation benchmarks, MANO matches the accuracy of ViT and Swin Transformer while reducing runtime and peak memory consumption by one to two orders of magnitude.
📝 Abstract
Transformers have become the de facto standard for a wide range of tasks, from image classification to physics simulations. Despite their impressive performance, the quadratic complexity of standard Transformers in both memory and time with respect to the input length makes them impractical for processing high-resolution inputs. Therefore, several variants have been proposed, the most successful relying on patchification, downsampling, or coarsening techniques, often at the cost of losing the finest-scale details. In this work, we take a different approach. Inspired by state-of-the-art techniques in $n$-body numerical simulations, we cast attention as an interaction problem between grid points. We introduce the Multipole Attention Neural Operator (MANO), which computes attention in a distance-based multiscale fashion. MANO maintains, in each attention head, a global receptive field and achieves linear time and memory complexity with respect to the number of grid points. Empirical results on image classification and Darcy flows demonstrate that MANO rivals state-of-the-art models such as ViT and Swin Transformer, while reducing runtime and peak memory usage by orders of magnitude. We open source our code for reproducibility at https://github.com/AlexColagrande/MANO.