π€ AI Summary
Open-vocabulary semantic segmentation (OVSS) faces dual challenges of imprecise pixel-level localization and low inference efficiency in zero-shot settings: contrastive learning lacks spatial fidelity, while diffusion models suffer from iterative computational overhead. This paper proposes a training-free, single-step diffusion inversion framework that achieves multi-class simultaneous segmentation for the first time. Its core contributions are: (1) a dual-prompt mechanism that decouples textβimage alignment; (2) hierarchical attention refinement, integrating scale-aligned self-attention and cross-attention maps to enhance fine-grained spatial representation; and (3) a test-time flipping strategy to improve spatial consistency. Evaluated on PASCAL VOC, PASCAL Context, and COCO Object, the method achieves a mean Intersection-over-Union (mIoU) of 43.8%, setting the new state-of-the-art among training-free approaches. Moreover, its inference speed significantly surpasses iterative diffusion-based methods.
π Abstract
Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the number of iterations with the quality of the segmentation. In this work, we propose FastSeg, a novel and efficient training-free framework with only (1+1)-step of reverse process of a pretrained diffusion model (e.g., Stable Diffusion). Moreover, instead of running multiple times for different classes, FastSeg performs segmentation for all classes at once. To further enhance the segmentation quality, FastSeg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances fused cross-attention using scale-aligned selfattention maps, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FastSeg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FastSeg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency.