HAViT: Historical Attention Vision Transformer

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the limitation in Vision Transformers where attention mechanisms across layers operate independently, hindering cross-layer information flow and feature refinement. To overcome this, the authors propose a cross-layer historical attention propagation method that stores and adaptively fuses historical attention matrices within the encoder using a fixed weighting coefficient (α = 0.45), enabling progressive optimization of attention patterns across layers. Notably, this approach introduces historical attention fusion into the original architecture with minimal modifications and reveals that random initialization of the historical component yields better performance than zero initialization. Experimental results demonstrate consistent improvements, with accuracy gains of 1.33% on CIFAR-100 and 1.25% on TinyImageNet, and similar enhancements are observed across variants such as CaiT.

Technology Category

Application Category

📝 Abstract

Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers

attention mechanism

cross-layer information flow

feature learning

historical attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Historical Attention

Vision Transformer

Cross-layer Attention Propagation