Maximizing the Position Embedding for Vision Transformers with Global Average Pooling

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In Vision Transformers (ViTs), positional embeddings (PEs) suffer from limited expressivity due to their simplistic element-wise addition with token embeddings; this limitation is exacerbated when global average pooling (GAP) replaces the class token, causing performance degradation across hierarchical layers. This work first identifies a critical *reverse balancing effect* of PEs during layer-wise propagation—where PEs dynamically counteract over-smoothing and preserve spatial structure. To address this, we propose MPVG: a novel PE enhancement framework featuring (i) layer-wise PE propagation, (ii) independent LayerNorm per layer for PE and token embeddings, (iii) dynamic balancing between PE and token representations, and (iv) GAP-adapted PE augmentation. MPVG significantly improves directional and structural modeling capacity of PEs. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate consistent and substantial gains over state-of-the-art ViT variants, validating that balanced PE dynamics within hierarchical architectures are decisive for overall model performance.

Technology Category

Application Category

📝 Abstract

In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position embedding is simply added to the token embedding. A layer-wise method that delivers PE to each layer and applies independent Layer Normalizations for token embedding and PE has been adopted to overcome this limitation. In this paper, we identify the conflicting result that occurs in a layer-wise structure when using the global average pooling (GAP) method instead of the class token. To overcome this problem, we propose MPVG, which maximizes the effectiveness of PE in a layer-wise structure with GAP. Specifically, we identify that PE counterbalances token embedding values at each layer in a layer-wise structure. Furthermore, we recognize that the counterbalancing role of PE is insufficient in the layer-wise structure, and we address this by maximizing the effectiveness of PE through MPVG. Through experiments, we demonstrate that PE performs a counterbalancing role and that maintaining this counterbalancing directionality significantly impacts vision transformers. As a result, the experimental results show that MPVG outperforms existing methods across vision transformers on various tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing position embedding in vision transformers

Addressing limitations with global average pooling

Proposing MPVG for improved PE effectiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Maximizes Position Embedding effectiveness

Uses Global Average Pooling

Applies Layer-wise normalization

🔎 Similar Papers

No similar papers found.

Authors to Follow