EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost of Vision Transformers (ViTs) and the limited representational capacity of lightweight models in image classification, this paper proposes EVCC—a multi-branch hybrid architecture integrating ViT, lightweight ConvNeXt, and CoAtNet. Its key contributions are: (1) information-preserving adaptive token pruning to dynamically compress redundant visual tokens; (2) gated bidirectional cross-attention to strengthen inter-branch feature interaction; and (3) a context-aware dynamic router coupled with multi-task auxiliary classification heads to enable collaborative branch optimization. Evaluated on CIFAR-100, Tobacco3482, CelebA, and Brain Cancer, EVCC achieves state-of-the-art accuracy while reducing FLOPs by 25%–35%, with up to a 2.0 percentage-point improvement—demonstrating superior efficiency–accuracy trade-off.

Technology Category

Application Category

📝 Abstract
Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost. We introduce EVCC (Enhanced Vision Transformer-ConvNeXt-CoAtNet), a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet through key innovations: (1) adaptive token pruning with information preservation, (2) gated bidirectional cross-attention for enhanced feature refinement, (3) auxiliary classification heads for multi-task learning, and (4) a dynamic router gate employing context-aware confidence-driven weighting. Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models like DeiT-Base, MaxViT-Base, and CrossViT-Base by consistently achieving state-of-the-art accuracy with improvements of up to 2 percentage points, while reducing FLOPs by 25 to 35%. Our adaptive architecture adjusts computational demands to deployment needs by dynamically reducing token count, efficiently balancing the accuracy-efficiency trade-off while combining global context, local details, and hierarchical features for real-world applications. The source code of our implementation is available at https://anonymous.4open.science/r/EVCC.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in hybrid vision architectures for image classification
Enhancing feature refinement through cross-attention and adaptive token pruning
Balancing accuracy and efficiency trade-offs in real-world vision applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive token pruning preserves information efficiently
Gated bidirectional cross-attention refines features effectively
Dynamic router gate enables context-aware confidence weighting
🔎 Similar Papers
No similar papers found.
K
Kazi Reyazul Hasan
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)
M
Md Nafiu Rahman
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)
W
Wasif Jalal
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)
S
Sadif Ahmed
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)
S
Shahriar Raj
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)
M
Mubasshira Musarrat
Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology (BUET)
Muhammad Abdullah Adnan
Muhammad Abdullah Adnan
Bangladesh University of Engineering and Technology (BUET), Dhaka, Bangladesh
Cloud ComputingDistributed ComputingDistributed Machine LearningArtificial IntelligenceNLP