Compact Vision Transformer by Reduction of Kernel Complexity

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

To address the high computational cost and large parameter count of vision transformers (ViTs), this paper proposes KCR-Transformer, a lightweighting method based on differentiable channel selection. Its core innovation is a theoretically grounded channel pruning mechanism guided by a compact generalization bound, enabling joint optimization of input and output channels in MLP layers. This allows controllable reduction of FLOPs and parameters while preserving generalization performance. Crucially, KCR-Transformer is fully compatible with mainstream architectures—including ViT and Swin—without modifying the self-attention modules. Extensive experiments across image classification and object detection demonstrate that KCR-Transformer achieves higher accuracy than the original models while reducing FLOPs by 35% and parameters by 42% on average. These results validate the effectiveness and architectural generality of theory-driven structural compression for vision transformers.

Technology Category

Application Category

📝 Abstract

Self-attention and transformer architectures have become foundational components in modern deep learning. Recent efforts have integrated transformer blocks into compact neural architectures for computer vision, giving rise to various efficient vision transformers. In this work, we introduce Transformer with Kernel Complexity Reduction, or KCR-Transformer, a compact transformer block equipped with differentiable channel selection, guided by a novel and sharp theoretical generalization bound. KCR-Transformer performs input/output channel selection in the MLP layers of transformer blocks to reduce the computational cost. Furthermore, we provide a rigorous theoretical analysis establishing a tight generalization bound for networks equipped with KCR-Transformer blocks. Leveraging such strong theoretical results, the channel pruning by KCR-Transformer is conducted in a generalization-aware manner, ensuring that the resulting network retains a provably small generalization error. Our KCR-Transformer is compatible with many popular and compact transformer networks, such as ViT and Swin, and it reduces the FLOPs of the vision transformers while maintaining or even improving the prediction accuracy. In the experiments, we replace all the transformer blocks in the vision transformers with KCR-Transformer blocks, leading to KCR-Transformer networks with different backbones. The resulting TCR-Transformers achieve superior performance on various computer vision tasks, achieving even better performance than the original models with even less FLOPs and parameters.

Problem

Research questions and friction points this paper is trying to address.

Reduces kernel complexity in vision transformers

Improves efficiency via channel selection in MLP layers

Maintains accuracy while lowering computational costs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable channel selection in MLP layers

Generalization-aware channel pruning method

Compatible with ViT and Swin transformers

🔎 Similar Papers

No similar papers found.