Lightweight Backbone Networks Only Require Adaptive Lightweight Self-Attention Mechanisms

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address the computational inefficiency imbalance between CNNs and attention mechanisms in lightweight hybrid backbones, this paper proposes a Fast Window Attention (FWA) mechanism with adaptive feature-map sizing, and introduces LOLViT—a lightweight global-local fusion network built upon GhostNet. FWA reduces computational overhead for long-sequence modeling via windowed token aggregation, adaptive feature-map compression, and ReLU-based SoftMax approximation; its key sequence generation is both compact and learnable, eliminating hyperparameter dependency from fixed-dimensional projections. Evaluated on ImageNet-1K, COCO2017, and BDD100K, LOLViT achieves accuracy comparable to or exceeding MobileViT-X while accelerating inference by up to 5×. The design thus delivers superior efficiency, strong generalization across vision tasks, and practical deployability on resource-constrained devices.

Technology Category

Application Category

📝 Abstract

Currently, lightweight hybrid backbone networks have partially alleviated the issue of computational saturation, but the imbalance in computational efficiencys between convolutional neural networks (CNNs) and attention mechanisms is becoming increasingly apparent. Specifically, although linear attention mechanisms and their variants have made progress in lightweight design, they still fail to meet the demands of hybrid models for long-sequence modeling. On the other hand, existing lightweight SoftMax attention computations typically reduce the feature map to a fixed size to decrease the number of sequences, thereby compressing the computational scale. However, the process of determining the feature map reduction ratio is cumbersome, and computational saturation issues still persist. To address this issue, this paper proposes a lightweight SoftMax attention mechanism with adaptive feature map sizes, named Fast Window Attention (FWA), which generates a small number of key sequences (Key and Value) through window aggregation for attention computation. Additionally, it explains the rationality of using ReLU to simulate SoftMax operations in lightweight global attention mechanisms. Finally, the paper designs a global-local feature fusion mechanism and combines it with GhostNet to propose a lightweight hybrid backbone network, LOLViT. Through visual tasks such as classification (ImageNet 1K), detection (COCO 2017), and segmentation (BDD100K), along with extensive ablation studies, it is demonstrated that LOLViT outperforms CNN models of the same level in both inference speed and model accuracy. Notably, the inference speed of LOLViT-X is 5x that of MobileViT-X.

Problem

Research questions and friction points this paper is trying to address.

Address imbalance between CNNs and attention mechanisms efficiency

Improve lightweight SoftMax attention for long-sequence modeling

Propose adaptive feature map size for computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive lightweight SoftMax attention mechanism

ReLU simulating SoftMax in global attention

Lightweight hybrid backbone network LOLViT

🔎 Similar Papers

No similar papers found.