EVCC: Enhanced Vision Transformer-ConvNeXt-CoAtNet Fusion for Classification

📅 2025-11-23

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the high computational cost of Vision Transformers (ViTs) and the limited representational capacity of lightweight models in image classification, this paper proposes EVCC—a multi-branch hybrid architecture integrating ViT, lightweight ConvNeXt, and CoAtNet. Its key contributions are: (1) information-preserving adaptive token pruning to dynamically compress redundant visual tokens; (2) gated bidirectional cross-attention to strengthen inter-branch feature interaction; and (3) a context-aware dynamic router coupled with multi-task auxiliary classification heads to enable collaborative branch optimization. Evaluated on CIFAR-100, Tobacco3482, CelebA, and Brain Cancer, EVCC achieves state-of-the-art accuracy while reducing FLOPs by 25%–35%, with up to a 2.0 percentage-point improvement—demonstrating superior efficiency–accuracy trade-off.

Technology Category

Application Category

📝 Abstract

Hybrid vision architectures combining Transformers and CNNs have significantly advanced image classification, but they usually do so at significant computational cost. We introduce EVCC (Enhanced Vision Transformer-ConvNeXt-CoAtNet), a novel multi-branch architecture integrating the Vision Transformer, lightweight ConvNeXt, and CoAtNet through key innovations: (1) adaptive token pruning with information preservation, (2) gated bidirectional cross-attention for enhanced feature refinement, (3) auxiliary classification heads for multi-task learning, and (4) a dynamic router gate employing context-aware confidence-driven weighting. Experiments across the CIFAR-100, Tobacco3482, CelebA, and Brain Cancer datasets demonstrate EVCC's superiority over powerful models like DeiT-Base, MaxViT-Base, and CrossViT-Base by consistently achieving state-of-the-art accuracy with improvements of up to 2 percentage points, while reducing FLOPs by 25 to 35%. Our adaptive architecture adjusts computational demands to deployment needs by dynamically reducing token count, efficiently balancing the accuracy-efficiency trade-off while combining global context, local details, and hierarchical features for real-world applications. The source code of our implementation is available at https://anonymous.4open.science/r/EVCC.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational costs in hybrid vision architectures for image classification

Enhancing feature refinement through cross-attention and adaptive token pruning

Balancing accuracy and efficiency trade-offs in real-world vision applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive token pruning preserves information efficiently

Gated bidirectional cross-attention refines features effectively

Dynamic router gate enables context-aware confidence weighting

🔎 Similar Papers

No similar papers found.