CoCAViT: Compact Vision Transformer with Robust Global Coordination

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing compact vision Transformers exhibit weak generalization and insufficient robustness on out-of-distribution (OOD) data. To address this, we propose CoCAViT—a compact vision Transformer featuring a strong holistic-local coordination mechanism. It restores efficient global modeling within a pure windowed attention framework via domain-aware dynamic global tokens and Coordinator-patch Cross Attention (CpCA). Additionally, it incorporates a lightweight global token design and multi-scale feature fusion. At 224×224 resolution, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, 52.2 mAP on COCO object detection, and 51.3 mIoU on ADE20K semantic segmentation—demonstrating competitive performance with low latency. Crucially, it significantly improves cross-domain robustness under distribution shifts, outperforming prior compact Transformers in OOD generalization while maintaining architectural efficiency.

Technology Category

Application Category

📝 Abstract
In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature modeling and adaptively capture robust patterns across domains with minimal computational overhead. Integrating these advancements, we present CoCAViT, a novel visual backbone designed for robust real-time visual representation. Extensive experiments empirically validate our design. At a resolution of 224*224, CoCAViT-28M achieves 84.0% top-1 accuracy on ImageNet-1K, with significant gains on multiple OOD benchmarks, compared to competing models. It also attains 52.2 mAP on COCO object detection and 51.3 mIOU on ADE20K semantic segmentation, while maintaining low latency.
Problem

Research questions and friction points this paper is trying to address.

Address performance drop in small models on OOD data
Improve generalization in efficient vision architectures
Enhance local-global feature modeling with minimal overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact Vision Transformer with robust design
Coordinator-patch Cross Attention for global modeling
Dynamic domain-aware tokens enhance feature adaptability
🔎 Similar Papers
No similar papers found.