Beyond the final layer: Attentive multilayer fusion for vision transformers

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the limitation of conventional linear probing approaches, which typically rely solely on the final-layer representations of Vision Transformers (ViTs) and neglect task-relevant cues embedded in intermediate layers, thereby constraining transfer performance under a frozen backbone. To overcome this, the authors propose an attention-driven multi-layer fusion mechanism that, for the first time, dynamically integrates features from all ViT layers using learnable attention weights while keeping the backbone frozen. This approach adaptively combines low-level structural details with high-level semantic abstractions. Extensive experiments across 20 diverse downstream datasets and multiple pretrained models demonstrate significant improvements over standard linear probing, with particularly pronounced gains in tasks exhibiting substantial domain shifts, thereby validating both the critical role of intermediate-layer information and the effectiveness of the proposed fusion strategy.

Technology Category

Application Category

📝 Abstract

With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.

Problem

Research questions and friction points this paper is trying to address.

vision transformers

linear probing

intermediate representations

layer fusion

downstream adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

attentive fusion

multilayer probing

Vision Transformer