🤖 AI Summary
This work investigates the fundamental mechanisms underlying the superior expressiveness of self-attention (SA) over convolution, identifying **adaptive routing** and **lateral inhibition** as the key factors. To bridge this gap, we propose Attentive Convolution (ATConv): a novel convolutional operator that natively integrates dynamic weight allocation and competitive feature selection into a standard 3×3 convolution, enabling content-aware information flow while preserving linear computational complexity and approximating SA’s modeling capacity. ATConv is the first convolutional design to seamlessly embed both mechanisms, transcending conventional fixed-receptive-field paradigms. Experiments demonstrate that CNNs built with ATConv achieve 84.4% Top-1 accuracy on ImageNet-1K using only 27M parameters. When substituting SA in diffusion models, ATConv reduces FID by 0.15 and accelerates sampling, confirming its unique combination of high expressiveness and computational efficiency.
📝 Abstract
Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) extit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) extit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose extit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3 imes3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain extbf{84.4%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3 imes 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.