🤖 AI Summary
To address the high computational and memory overhead of Transformer-based models in lightweight image super-resolution (SR) caused by self-attention, this paper proposes ConvAttn—a convolutional self-attention module that jointly leverages shared large-kernel convolution and dynamic convolution for efficient long-range modeling and instance-aware weighting. It further introduces the first large-scale deployment of Flash Attention (with window size up to 32×32) in lightweight SR, integrated into an end-to-end convolution–attention hybrid architecture. On Urban100×2, our method achieves a +0.31 dB PSNR gain while reducing inference latency and GPU memory consumption by 16× and 12.2×, respectively. On Urban100×4, it outperforms HiT-SRF by +0.27 dB PSNR with 3.7× lower latency and 6.2× reduced memory usage. The core contributions lie in (i) a highly efficient attention design tailored for resource-constrained SR, and (ii) hardware-friendly, large-window Flash Attention implementation validated in practice.
📝 Abstract
In this paper, we tackle the high computational overhead of transformers for lightweight image super-resolution. (SR). Motivated by the observations of self-attention's inter-layer repetition, we introduce a convolutionized self-attention module named Convolutional Attention (ConvAttn) that emulates self-attention's long-range modeling capability and instance-dependent weighting with a single shared large kernel and dynamic kernels. By utilizing the ConvAttn module, we significantly reduce the reliance on self-attention and its involved memory-bound operations while maintaining the representational capability of transformers. Furthermore, we overcome the challenge of integrating flash attention into the lightweight SR regime, effectively mitigating self-attention's inherent memory bottleneck. We scale up window size to 32$ imes$32 with flash attention rather than proposing an intricated self-attention module, significantly improving PSNR by 0.31dB on Urban100$ imes$2 while reducing latency and memory usage by 16$ imes$ and 12.2$ imes$. Building on these approaches, our proposed network, termed Emulating Self-attention with Convolution (ESC), notably improves PSNR by 0.27 dB on Urban100$ imes$4 compared to HiT-SRF, reducing the latency and memory usage by 3.7$ imes$ and 6.2$ imes$, respectively. Extensive experiments demonstrate that our ESC maintains the ability for long-range modeling, data scalability, and the representational power of transformers despite most self-attentions being replaced by the ConvAttn module.