🤖 AI Summary
Vision Transformers suffer from quadratically growing computational complexity in self-attention when processing high-resolution images. To address this bottleneck, this work proposes the VECA architecture, which introduces a small set of learnable core tokens as communication mediators, enabling image patches to interact exclusively through these cores and thereby achieving linear-complexity visual representation learning. Departing from the conventional assumption of direct patch-to-patch interactions, VECA employs an elastic core-to-periphery attention mechanism that retains all input tokens while allowing flexible trade-offs between computational cost and accuracy during inference. Experiments demonstrate that VECA attains performance on par with state-of-the-art vision foundation models across both image classification and dense prediction tasks, while substantially reducing computational overhead.
📝 Abstract
Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.