Memory Efficient Transformer Adapter for Dense Predictions

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing ViT adapters suffer from suboptimal inference speed due to inefficient memory access patterns—such as standard layer normalization and frequent tensor reshaping. To address memory-efficiency bottlenecks in dense prediction tasks, this paper proposes META, a lightweight adapter featuring: (1) shared-layer normalization to reduce redundant computation and memory traffic; (2) cross-shaped self-attention, balancing global contextual modeling with local receptive fields; and (3) a synergistic design of lightweight convolutional branches and cascaded multi-head feature heads, integrating local inductive bias with multi-granularity representations. Theoretical analysis demonstrates superior generalization and adaptability. Extensive experiments on object detection, instance segmentation, and semantic segmentation show that META significantly improves the accuracy–efficiency trade-off: it reduces memory access overhead by 32–47% and accelerates inference by 1.8–2.3× over state-of-the-art adapters.

Technology Category

Application Category

📝 Abstract

While current Vision Transformer (ViT) adapter methods have shown promising accuracy, their inference speed is implicitly hindered by inefficient memory access operations, e.g., standard normalization and frequent reshaping. In this work, we propose META, a simple and fast ViT adapter that can improve the model's memory efficiency and decrease memory time consumption by reducing the inefficient memory access operations. Our method features a memory-efficient adapter block that enables the common sharing of layer normalization between the self-attention and feed-forward network layers, thereby reducing the model's reliance on normalization operations. Within the proposed block, the cross-shaped self-attention is employed to reduce the model's frequent reshaping operations. Moreover, we augment the adapter block with a lightweight convolutional branch that can enhance local inductive biases, particularly beneficial for the dense prediction tasks, e.g., object detection, instance segmentation, and semantic segmentation. The adapter block is finally formulated in a cascaded manner to compute diverse head features, thereby enriching the variety of feature representations. Empirically, extensive evaluations on multiple representative datasets validate that META substantially enhances the predicted quality, while achieving a new state-of-the-art accuracy-efficiency trade-off. Theoretically, we demonstrate that META exhibits superior generalization capability and stronger adaptability.

Problem

Research questions and friction points this paper is trying to address.

Improves memory efficiency in Vision Transformers

Reduces inefficient memory access operations

Enhances accuracy-efficiency trade-off in dense predictions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-efficient adapter block

Cross-shaped self-attention

Lightweight convolutional branch

🔎 Similar Papers

No similar papers found.