FFNet: MetaMixer-based Efficient Convolutional Mixer Design

📅 2024-06-04
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Vision backbone networks suffer from redundancy in feed-forward network (FFN) design and high computational overhead in attention mechanisms. Method: We propose FFNet—a purely convolutional architecture grounded in the novel “FFNification” paradigm—unifying the QKV abstraction framework with FFN structure. It replaces QKV linear projections and Softmax with large-kernel depthwise convolutions, substitutes non-linear activation with GELU, and integrates ConvNeXt-style residual blocks to construct a lightweight, efficient token mixer. We further introduce MetaMixer, a generic mixer architecture that decouples token-mixing from channel-mixing. Contribution/Results: FFNet establishes the foundational insight that the QKV framework is more fundamental than its specific instantiations (e.g., self-attention or FFN). It outperforms same-scale specialized models across image classification, detection, and segmentation, while significantly improving inference speed and parameter efficiency—demonstrating the general applicability and effectiveness of the QKV framework within pure convolutional architectures.

Technology Category

Application Category

📝 Abstract
Transformer, composed of self-attention and Feed-Forward Network, has revolutionized the landscape of network design across various vision tasks. While self-attention is extensively explored as a key factor in performance, FFN has received little attention. FFN is a versatile operator seamlessly integrated into nearly all AI models to effectively harness rich representations. Recent works also show that FFN functions like key-value memories. Thus, akin to the query-key-value mechanism within self-attention, FFN can be viewed as a memory network, where the input serves as query and the two projection weights operate as keys and values, respectively. Based on these observations, we hypothesize that the importance lies in query-key-value framework itself for competitive performance. To verify this, we propose converting self-attention into a more FFN-like efficient token mixer with only convolutions while retaining query-key-value framework, namely FFNification. Specifically, FFNification replaces query-key-value interactions with large kernel convolutions and adopts GELU activation function instead of softmax. The derived token mixer, FFNified attention, serves as key-value memories for detecting locally distributed spatial patterns, and operates in the opposite dimension to the ConvNeXt block within each corresponding sub-operation of the query-key-value framework. Building upon the above two modules, we present a family of Fast-Forward Networks (FFNet). Despite being composed of only simple operators, FFNet outperforms sophisticated and highly specialized methods in each domain, with notable efficiency gains. These results validate our hypothesis, leading us to propose MetaMixer, a general mixer architecture that does not specify sub-operations within the query-key-value framework.
Problem

Research questions and friction points this paper is trying to address.

Explores FFN as a memory network in AI models.
Proposes FFNification to convert self-attention into efficient token mixer.
Introduces FFNet, outperforming specialized methods with efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

FFNification replaces self-attention with convolutions
Uses large kernel convolutions for query-key-value interactions
Introduces MetaMixer as a general mixer architecture
🔎 Similar Papers
No similar papers found.