🤖 AI Summary
Existing few-shot classification and segmentation (FS-CS) methods exhibit suboptimal performance on small-object recognition, while standard evaluation protocols neglect costly pixel-level annotations—undermining practical applicability. To address these limitations, we propose the Efficient Mask Attention Transformer (EMAT): (1) a memory-efficient mask attention mechanism that strengthens small-object feature modeling; (2) learnable downsampling and parameter optimization strategies to reduce model complexity; and (3) a realistic, annotation-aware evaluation protocol that maximally reuses existing labels. EMAT enables end-to-end joint classification and segmentation within a unified Transformer architecture. On PASCAL-5ⁱ and COCO-20ⁱ benchmarks, it surpasses all state-of-the-art methods, reduces parameter count by over 4×, and achieves significant gains in both small-object segmentation and classification accuracy.
📝 Abstract
Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5$^i$ and COCO-20$^i$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.