SEMA: a Scalable and Efficient Mamba like Attention via Token Localization and Averaging

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional vision attention mechanisms suffer from two fundamental limitations: quadratic computational complexity and poor focus capability in linear variants. This paper proposes SEMA, a novel attention framework that first formally defines generalized attention and theoretically proves its inherent dispersion property. Leveraging this insight, SEMA introduces a dual-path “localization + averaging” architecture: token localization suppresses weight dispersion, while theoretically consistent arithmetic mean aggregation restores focusing ability—preserving linear complexity. Furthermore, SEMA integrates Mamba-inspired state-space modeling to enhance long-range dependency capture. On ImageNet-1K, SEMA significantly outperforms vision Mamba models of comparable parameter count; notably, its performance gain increases with larger image resolutions, demonstrating superior scalability and effectiveness.

Technology Category

Application Category

📝 Abstract
Attention is the critical component of a transformer. Yet the quadratic computational complexity of vanilla full attention in the input size and the inability of its linear attention variant to focus have been challenges for computer vision tasks. We provide a mathematical definition of generalized attention and formulate both vanilla softmax attention and linear attention within the general framework. We prove that generalized attention disperses, that is, as the number of keys tends to infinity, the query assigns equal weights to all keys. Motivated by the dispersion property and recent development of Mamba form of attention, we design Scalable and Efficient Mamba like Attention (SEMA) which utilizes token localization to avoid dispersion and maintain focusing, complemented by theoretically consistent arithmetic averaging to capture global aspect of attention. We support our approach on Imagenet-1k where classification results show that SEMA is a scalable and effective alternative beyond linear attention, outperforming recent vision Mamba models on increasingly larger scales of images at similar model parameter sizes.
Problem

Research questions and friction points this paper is trying to address.

Addresses quadratic complexity of vanilla full attention
Solves linear attention's inability to focus effectively
Proposes SEMA for scalable efficient attention in vision tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token localization prevents attention dispersion
Arithmetic averaging captures global attention aspects
Scalable Mamba-like attention for vision tasks
🔎 Similar Papers
No similar papers found.
N
Nhat Thanh Tran
Department of Mathematics, University of California, Irvine
F
Fanghui Xue
Qualcomm AI Research
S
Shuai Zhang
Qualcomm AI Research
Jiancheng Lyu
Jiancheng Lyu
Qualcomm AI Research
PDEsOptimizationDeep Learning
Y
Yunling Zheng
Qualcomm AI Research
Y
Yingyong Qi
Qualcomm AI Research
Jack Xin
Jack Xin
Distinguished Professor of Mathematics, UC Irvine
Applied Computational MathMachine Learningand Applications