HAViT: Historical Attention Vision Transformer

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation in Vision Transformers where attention mechanisms across layers operate independently, hindering cross-layer information flow and feature refinement. To overcome this, the authors propose a cross-layer historical attention propagation method that stores and adaptively fuses historical attention matrices within the encoder using a fixed weighting coefficient (α = 0.45), enabling progressive optimization of attention patterns across layers. Notably, this approach introduces historical attention fusion into the original architecture with minimal modifications and reveals that random initialization of the historical component yields better performance than zero initialization. Experimental results demonstrate consistent improvements, with accuracy gains of 1.33% on CIFAR-100 and 1.25% on TinyImageNet, and similar enhancements are observed across variants such as CaiT.

Technology Category

Application Category

📝 Abstract
Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers
attention mechanism
cross-layer information flow
feature learning
historical attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Historical Attention
Vision Transformer
Cross-layer Attention Propagation
Attention Blending
Feature Refinement
🔎 Similar Papers
No similar papers found.
S
Swarnendu Banik
Computer Vision and Biometrics Lab, Indian Institute of Information Technology, Allahabad
M
Manish Das
Computer Vision and Biometrics Lab, Indian Institute of Information Technology, Allahabad
Shiv Ram Dubey
Shiv Ram Dubey
Associate Professor, Indian Institute of Information Technology (IIIT), Allahabad, Prayagraj, BHARAT
Computer VisionDeep LearningBiometricsMedical Imaging
S
Satish Kumar Singh
Computer Vision and Biometrics Lab, Indian Institute of Information Technology, Allahabad