SCHEME: Scalable Channel Mixer for Vision Transformers

📅 2023-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision Transformers (ViTs) suffer from severe parameter and computational redundancy in channel-mixing modules (e.g., FFN/MLP), yet existing approaches inadequately address this issue. Method: We propose a synergistic architecture comprising block-diagonal MLPs and parameter-free Channel Covariance Attention (CCA). The block-diagonal structure enforces structured sparsity, enhancing scalability; CCA dynamically models inter-group channel dependencies during training to improve feature mixing, while incurring zero inference overhead as it is discarded at deployment. Contribution/Results: This is the first work to decouple scalability and efficiency in channel mixing. Our plug-and-play method consistently outperforms standard ViT baselines across image classification, detection, and segmentation—especially under low-FLOPS or small-model regimes—establishing a new Pareto-optimal frontier in accuracy–FLOPS–parameter count–throughput trade-offs.
📝 Abstract
Vision Transformers have achieved impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, much less research has been devoted to the channel mixer or feature mixing block (FFN or MLP), which accounts for a significant portion of of the model parameters and computation. In this work, we show that the dense MLP connections can be replaced with a block diagonal MLP structure that supports larger expansion ratios by splitting MLP features into groups. To improve the feature clusters formed by this structure we propose the use of a lightweight, parameter-free, channel covariance attention (CCA) mechanism as a parallel branch during training. This enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. In result, the CCA block can be discarded during inference, enabling enhanced performance at no additional computational cost. The resulting $ extit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal MLP structure. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially for lower complexity regimes. The SCHEMEformer family is shown to establish new Pareto frontiers for accuracy vs FLOPS, accuracy vs model size, and accuracy vs throughput, especially for fast transformers of small size.
Problem

Research questions and friction points this paper is trying to address.

Improving channel mixer efficiency in Vision Transformers
Reducing computation cost without sacrificing performance
Enhancing feature mixing across channel groups dynamically
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse block diagonal MLP structure
Lightweight channel covariance attention mechanism
Plug-and-play scalable channel mixer design
🔎 Similar Papers
No similar papers found.