Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation

📅 2024-04-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

To address the excessive computational overhead of Transformer encoders in general-purpose image segmentation, this paper proposes PRO-SCALE—a progressive token-length scaling strategy that dynamically and hierarchically reduces the input token sequence length in Mask2Former’s encoder. PRO-SCALE innovatively integrates three components: (i) multi-scale feature-adaptive downsampling, (ii) inter-layer token-length scheduling, and (iii) query-aware token retention. This design achieves substantial efficiency gains without compromising accuracy. On COCO, PRO-SCALE reduces encoder GFLOPs by 52% and overall model GFLOPs by 27%, while maintaining mAP unchanged. Moreover, it demonstrates strong generalization across both segmentation and detection tasks. By enabling scalable, architecture-agnostic token compression, PRO-SCALE establishes a new lightweight paradigm for efficient general-purpose segmentation.

Technology Category

Application Category

📝 Abstract

A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. With efficiency being a high priority for scaling such models, we observed that the state-of-the-art method Mask2Former uses 50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. With this observation, we propose a strategy termed PROgressive Token Length SCALing for Efficient transformer encoders (PRO-SCALE) that can be plugged-in to the Mask2Former segmentation architecture to significantly reduce the computational cost. The underlying principle of PRO-SCALE is: progressively scale the length of the tokens with the layers of the encoder. This allows PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance (~52% encoder and ~27% overall GFLOPs reduction with no drop in performance on COCO dataset). Experiments conducted on public benchmarks demonstrates PRO-SCALE's flexibility in architectural configurations, and exhibits potential for extension beyond the settings of segmentation tasks to encompass object detection. Code here: https://github.com/abhishekaich27/proscale-pytorch

Problem

Research questions and friction points this paper is trying to address.

Image Segmentation

Resource Efficiency

Hierarchical Detail Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Token Length Expansion

Mask2Former Architecture

Efficient Image Segmentation

🔎 Similar Papers

No similar papers found.