TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition

📅 2023-10-30

🏛️ arXiv.org

📈 Citations: 20

✨ Influential: 1

🤖 AI Summary

To address the limited representational capacity of conventional static convolutions in CNN-Transformer hybrid architectures, this paper proposes the input-adaptive Dual-Dynamic Token Mixer (D-Mixer)—the first to jointly integrate input-driven depthwise separable convolution with lightweight global attention for synergistic modeling of local details and long-range dependencies. D-Mixer enables dynamic cross-module feature fusion and adaptive receptive field expansion, overcoming the inherent limitations of static convolution. Built upon D-Mixer, TransXNet-T achieves a 0.3% top-1 accuracy gain on ImageNet-1K over Swin-T while consuming less than 50% of its FLOPs; its small and base variants attain 83.8% and 84.6% top-1 accuracy, respectively. Moreover, TransXNet demonstrates superior performance on dense prediction tasks—e.g., semantic segmentation and object detection—at significantly lower computational cost, surpassing state-of-the-art methods.

📝 Abstract

Recent studies have integrated convolution into transformers to introduce inductive bias and improve generalization performance. However, the static nature of conventional convolution prevents it from dynamically adapting to input variations, resulting in a representation discrepancy between convolution and self-attention as self-attention calculates attention matrices dynamically. Furthermore, when stacking token mixers that consist of convolution and self-attention to form a deep network, the static nature of convolution hinders the fusion of features previously generated by self-attention into convolution kernels. These two limitations result in a sub-optimal representation capacity of the constructed networks. To find a solution, we propose a lightweight Dual Dynamic Token Mixer (D-Mixer) that aggregates global information and local details in an input-dependent way. D-Mixer works by applying an efficient global attention module and an input-dependent depthwise convolution separately on evenly split feature segments, endowing the network with strong inductive bias and an enlarged effective receptive field. We use D-Mixer as the basic building block to design TransXNet, a novel hybrid CNN-Transformer vision backbone network that delivers compelling performance. In the ImageNet-1K image classification task, TransXNet-T surpasses Swin-T by 0.3% in top-1 accuracy while requiring less than half of the computational cost. Furthermore, TransXNet-S and TransXNet-B exhibit excellent model scalability, achieving top-1 accuracy of 83.8% and 84.6% respectively, with reasonable computational costs. Additionally, our proposed network architecture demonstrates strong generalization capabilities in various dense prediction tasks, outperforming other state-of-the-art networks while having lower computational costs. Code is available at https://github.com/LMMMEng/TransXNet.

Problem

Research questions and friction points this paper is trying to address.

Addresses static convolution limitations in hybrid CNN-Transformer networks.

Proposes Dual Dynamic Token Mixer for global and local dynamics learning.

Enhances network performance with reduced computational costs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Dynamic Token Mixer for global and local dynamics

Efficient global attention and input-dependent convolution

Hybrid CNN-Transformer backbone with enhanced performance

🔎 Similar Papers

No similar papers found.

Authors to Follow