Lightweight Backbone Networks Only Require Adaptive Lightweight Self-Attention Mechanisms

πŸ“… 2025-08-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

233K/year
πŸ€– AI Summary
To address the computational inefficiency imbalance between CNNs and attention mechanisms in lightweight hybrid backbones, this paper proposes a Fast Window Attention (FWA) mechanism with adaptive feature-map sizing, and introduces LOLViTβ€”a lightweight global-local fusion network built upon GhostNet. FWA reduces computational overhead for long-sequence modeling via windowed token aggregation, adaptive feature-map compression, and ReLU-based SoftMax approximation; its key sequence generation is both compact and learnable, eliminating hyperparameter dependency from fixed-dimensional projections. Evaluated on ImageNet-1K, COCO2017, and BDD100K, LOLViT achieves accuracy comparable to or exceeding MobileViT-X while accelerating inference by up to 5Γ—. The design thus delivers superior efficiency, strong generalization across vision tasks, and practical deployability on resource-constrained devices.

Technology Category

Application Category

πŸ“ Abstract
Currently, lightweight hybrid backbone networks have partially alleviated the issue of computational saturation, but the imbalance in computational efficiencys between convolutional neural networks (CNNs) and attention mechanisms is becoming increasingly apparent. Specifically, although linear attention mechanisms and their variants have made progress in lightweight design, they still fail to meet the demands of hybrid models for long-sequence modeling. On the other hand, existing lightweight SoftMax attention computations typically reduce the feature map to a fixed size to decrease the number of sequences, thereby compressing the computational scale. However, the process of determining the feature map reduction ratio is cumbersome, and computational saturation issues still persist. To address this issue, this paper proposes a lightweight SoftMax attention mechanism with adaptive feature map sizes, named Fast Window Attention (FWA), which generates a small number of key sequences (Key and Value) through window aggregation for attention computation. Additionally, it explains the rationality of using ReLU to simulate SoftMax operations in lightweight global attention mechanisms. Finally, the paper designs a global-local feature fusion mechanism and combines it with GhostNet to propose a lightweight hybrid backbone network, LOLViT. Through visual tasks such as classification (ImageNet 1K), detection (COCO 2017), and segmentation (BDD100K), along with extensive ablation studies, it is demonstrated that LOLViT outperforms CNN models of the same level in both inference speed and model accuracy. Notably, the inference speed of LOLViT-X is 5x that of MobileViT-X.
Problem

Research questions and friction points this paper is trying to address.

Address imbalance between CNNs and attention mechanisms efficiency
Improve lightweight SoftMax attention for long-sequence modeling
Propose adaptive feature map size for computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive lightweight SoftMax attention mechanism
ReLU simulating SoftMax in global attention
Lightweight hybrid backbone network LOLViT
πŸ”Ž Similar Papers
No similar papers found.
F
Fengyun Li
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Chao Zheng
Chao Zheng
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Y
Yangyang Fang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
J
Jialiang Lan
School of Computer Science and Engineering, Northeastern University, Shenyang, China
J
Jianhua Liang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Luhao Zhang
Luhao Zhang
Assistant Professor, Johns Hopkins University
Decision ScienceHuman-AI interaction
F
Fa Si
School of Computer Science and Engineering, Northeastern University, Shenyang, China