🤖 AI Summary
This work addresses the high computational complexity of softmax attention in standard Transformers, which hinders scalability to high-resolution vision tasks, and the insufficient modeling of mid-range token interactions in existing linear attention methods due to their lack of theoretical guarantees. To overcome these limitations, the authors propose LaplacianFormer, which introduces the Laplacian kernel into linear attention for the first time. It employs a theoretically provable injective feature mapping to preserve fine-grained information and integrates Nyström approximation with Newton–Schulz iteration to enable efficient, matrix-inversion-free computation. Experiments demonstrate that LaplacianFormer achieves an excellent trade-off between performance and efficiency on ImageNet, significantly enhancing mid-range interaction modeling while remaining suitable for deployment on edge devices.
📝 Abstract
The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but such approximations lack theoretical grounding and tend to oversuppress mid-range token interactions. We propose LaplacianFormer, a Transformer variant that employs a Laplacian kernel as a principled alternative to softmax, motivated by empirical observations and theoretical analysis. To address expressiveness degradation under low-rank approximations, we introduce a provably injective feature map that retains fine-grained token information. For efficient computation, we adopt a Nyström approximation of the kernel matrix and solve the resulting system using Newton--Schulz iteration, avoiding costly matrix inversion and SVD. We further develop custom CUDA implementations for both the kernel and solver, enabling high-throughput forward and backward passes suitable for edge deployment. Experiments on ImageNet show that LaplacianFormer achieves strong performance-efficiency trade-offs while improving attention expressiveness.