Attentive Convolution: Unifying the Expressivity of Self-Attention with Convolutional Efficiency

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work investigates the fundamental mechanisms underlying the superior expressiveness of self-attention (SA) over convolution, identifying **adaptive routing** and **lateral inhibition** as the key factors. To bridge this gap, we propose Attentive Convolution (ATConv): a novel convolutional operator that natively integrates dynamic weight allocation and competitive feature selection into a standard 3×3 convolution, enabling content-aware information flow while preserving linear computational complexity and approximating SA’s modeling capacity. ATConv is the first convolutional design to seamlessly embed both mechanisms, transcending conventional fixed-receptive-field paradigms. Experiments demonstrate that CNNs built with ATConv achieve 84.4% Top-1 accuracy on ImageNet-1K using only 27M parameters. When substituting SA in diffusion models, ATConv reduces FID by 0.15 and accelerates sampling, confirming its unique combination of high expressiveness and computational efficiency.

Technology Category

Application Category

📝 Abstract

Self-attention (SA) has become the cornerstone of modern vision backbones for its powerful expressivity over traditional Convolutions (Conv). However, its quadratic complexity remains a critical bottleneck for practical applications. Given that Conv offers linear complexity and strong visual priors, continuing efforts have been made to promote the renaissance of Conv. However, a persistent performance chasm remains, highlighting that these modernizations have not yet captured the intrinsic expressivity that defines SA. In this paper, we re-examine the design of the CNNs, directed by a key question: what principles give SA its edge over Conv? As a result, we reveal two fundamental insights that challenge the long-standing design intuitions in prior research (e.g., Receptive field). The two findings are: (1) extit{Adaptive routing}: SA dynamically regulates positional information flow according to semantic content, whereas Conv employs static kernels uniformly across all positions. (2) extit{Lateral inhibition}: SA induces score competition among token weighting, effectively suppressing redundancy and sharpening representations, whereas Conv filters lack such inhibitory dynamics and exhibit considerable redundancy. Based on this, we propose extit{Attentive Convolution} (ATConv), a principled reformulation of the convolutional operator that intrinsically injects these principles. Interestingly, with only $3 imes3$ kernels, ATConv consistently outperforms various SA mechanisms in fundamental vision tasks. Building on ATConv, we introduce AttNet, a CNN family that can attain extbf{84.4%} ImageNet-1K Top-1 accuracy with only 27M parameters. In diffusion-based image generation, replacing all SA with the proposed $3 imes 3$ ATConv in SiT-XL/2 reduces ImageNet FID by 0.15 in 400k steps with faster sampling. Code is available at: github.com/price112/Attentive-Convolution.

Problem

Research questions and friction points this paper is trying to address.

Self-attention has quadratic complexity limiting practical applications

Convolutions lack adaptive routing and lateral inhibition mechanisms

Existing Conv modernizations fail to match self-attention expressivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

ATConv combines self-attention principles with convolutional efficiency

ATConv uses adaptive routing for dynamic information flow

ATConv employs lateral inhibition to reduce redundancy

🔎 Similar Papers

No similar papers found.