ReGLA: Efficient Receptive-Field Modeling with Gated Linear Attention Network

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing accuracy and inference efficiency in lightweight models for high-resolution vision tasks, where Transformers often suffer from significant latency due to their high computational complexity. To this end, we propose ReGLA, an efficient hybrid CNN-attention architecture featuring two key components: the Efficient Large Receptive Field (ELRF) module, which enhances convolutional efficiency through an expanded receptive field, and the ReLU-Gated Multi-head Attention (RGMA) module, which enables linear-complexity global modeling via ReLU-gated linear attention. Furthermore, multi-teacher knowledge distillation is integrated to improve generalization. The ReGLA-M variant achieves a Top-1 accuracy of 80.85% on ImageNet-1K (224px) and requires only 4.98 ms for inference on 512px images. It also outperforms comparable models by 3.1% AP on COCO object detection and by 3.6% mIoU on ADE20K semantic segmentation.

Technology Category

Application Category

📝 Abstract
Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
Problem

Research questions and friction points this paper is trying to address.

lightweight models
high-resolution images
accuracy-latency trade-off
Transformer-based architectures
computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gated Linear Attention
Efficient Large Receptive Field
ReLU Gated Modulated Attention
Multi-teacher Distillation
Lightweight Transformer
🔎 Similar Papers
No similar papers found.
J
Junzhou Li
University of Science and Technology of China
Manqi Zhao
Manqi Zhao
Google Inc.
Machine LearningData mining
Y
Yilin Gao
Shanghai University
Z
Zhiheng Yu
Huawei Technologies Co., Ltd.
Y
Yin Li
Huawei Technologies Co., Ltd.
D
Dongsheng Jiang
Huawei Technologies Co., Ltd.
Li Xiao
Li Xiao
University of Science and Technology of China