Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the fundamental mechanisms underlying the superior expressiveness of self-attention (SA) over convolution, identifying **adaptive routing** and **lateral inhibition** as the key factors. To bridge this gap, we propose Attentive Convolution (ATConv): a novel convolutional operator that natively integrates dynamic weight allocation and competitive feature selection into a standard 3×3 convolution, enabling content-aware information flow while preserving linear computational complexity and approximating SA’s modeling capacity. ATConv is the first convolutional design to seamlessly embed both mechanisms, transcending conventional fixed-receptive-field paradigms. Experiments demonstrate that CNNs built with ATConv achieve 84.4% Top-1 accuracy on ImageNet-1K using only 27M parameters. When substituting SA in diffusion models, ATConv reduces FID by 0.15 and accelerates sampling, confirming its unique combination of high expressiveness and computational efficiency.

Technology Category

Application Category

📝 Abstract
Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) extit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) extit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose extit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3 imes3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain extbf{84.4%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3 imes 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.
Problem

Research questions and friction points this paper is trying to address.

Self-attention has quadratic complexity limiting practical applications
Convolutions lack adaptive routing and lateral inhibition mechanisms
Existing Conv modernizations fail to match self-attention expressivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

ATConv combines self-attention principles with convolutional efficiency
ATConv uses adaptive routing for dynamic information flow
ATConv employs lateral inhibition to reduce redundancy
🔎 Similar Papers
No similar papers found.
H
Hao Yu
Center for Machine Vision and Signal Analysis, University of Oulu, Finland
H
Haoyu Chen
Center for Machine Vision and Signal Analysis, University of Oulu, Finland
Y
Yan Jiang
Center for Machine Vision and Signal Analysis, University of Oulu, Finland
W
Wei Peng
Department of Psychiatry and Behavioral Sciences, Stanford University, USA
Zhaodong Sun
Zhaodong Sun
Nanjing University of Information Science and Technology (NUIST)
Biomedical Signal ProcessingrPPGMachine VisionImage and Video Processing
Samuel Kaski
Samuel Kaski
Director, ELLIS Institute Finland; Professor, Aalto University and University of Manchester
Probabilistic machine learningAI4ScienceCollaborative AI
Guoying Zhao
Guoying Zhao
Academy Professor, IEEE Fellow, Professor of Computer Science and Engineering, University of Oulu
Affective ComputingArtificial IntelligenceComputer VisionPattern Recognition