Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

📅 2025-07-31

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Existing few-shot classification and segmentation (FS-CS) methods exhibit suboptimal performance on small-object recognition, while standard evaluation protocols neglect costly pixel-level annotations—undermining practical applicability. To address these limitations, we propose the Efficient Mask Attention Transformer (EMAT): (1) a memory-efficient mask attention mechanism that strengthens small-object feature modeling; (2) learnable downsampling and parameter optimization strategies to reduce model complexity; and (3) a realistic, annotation-aware evaluation protocol that maximally reuses existing labels. EMAT enables end-to-end joint classification and segmentation within a unified Transformer architecture. On PASCAL-5ⁱ and COCO-20ⁱ benchmarks, it surpasses all state-of-the-art methods, reduces parameter count by over 4×, and achieves significant gains in both small-object segmentation and classification accuracy.

Technology Category

Application Category

📝 Abstract

Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5$^i$ and COCO-20$^i$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.

Problem

Research questions and friction points this paper is trying to address.

Improves few-shot classification and segmentation accuracy

Addresses challenges with small object detection

Introduces novel evaluation settings for practical scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-efficient masked attention mechanism

Learnable downscaling strategy

Parameter-efficiency enhancements

🔎 Similar Papers

MSDNet: Multi-Scale Decoder for Few-Shot Semantic Segmentation via Transformer-Guided Prototyping