ENACT: Entropy-based Clustering of Attention Input for Reducing the Computational Needs of Object Detection Transformers

📅 2024-09-11

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Vision Transformers (ViTs) suffer from quadratic computational and memory complexity in self-attention with respect to input sequence length, hindering their efficiency in object detection. To address this, we propose ENACT—a plug-and-play, entropy-driven attention compression module that differentiably clusters attention inputs based on pixel-level feature entropy similarity, reducing sequence length without modifying the backbone. ENACT comprises three core components: entropy-aware feature clustering, a multi-head attention adaptation mechanism, and a lightweight learnable clusterer. Evaluated on COCO, ENACT significantly reduces GPU memory consumption and training cost across three state-of-the-art detection Transformers—DETR, Deformable DETR, and DINO—while incurring only a marginal mAP degradation of 0.5–1.2 points. This demonstrates an effective trade-off between computational efficiency and detection accuracy.

Technology Category

Application Category

📝 Abstract

Transformers demonstrate competitive performance in terms of precision on the problem of vision-based object detection. However, they require considerable computational resources due to the quadratic size of the attention weights. In this work, we propose to cluster the transformer input on the basis of its entropy, due to its similarity between same object pixels. This is expected to reduce GPU usage during training, while maintaining reasonable accuracy. This idea is realized with an implemented module that is called ENtropy-based Attention Clustering for detection Transformers (ENACT), which serves as a plug-in to any multi-head self-attention based transformer network. Experiments on the COCO object detection dataset and three detection transformers demonstrate that the requirements on memory are reduced, while the detection accuracy is degraded only slightly. The code of the ENACT module is available at https://github.com/GSavathrakis/ENACT.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational resources for object detection transformers

Clustering transformer input based on entropy similarity

Maintaining accuracy while decreasing GPU memory usage

Innovation

Methods, ideas, or system contributions that make the work stand out.

Clusters transformer input using entropy

Reduces GPU usage during training

Maintains accuracy with less memory

🔎 Similar Papers

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens