FastSeg: Efficient Training-Free Open-Vocabulary Segmentation via Hierarchical Attention Refinement Method

📅 2025-06-29

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Open-vocabulary semantic segmentation (OVSS) faces dual challenges of imprecise pixel-level localization and low inference efficiency in zero-shot settings: contrastive learning lacks spatial fidelity, while diffusion models suffer from iterative computational overhead. This paper proposes a training-free, single-step diffusion inversion framework that achieves multi-class simultaneous segmentation for the first time. Its core contributions are: (1) a dual-prompt mechanism that decouples text–image alignment; (2) hierarchical attention refinement, integrating scale-aligned self-attention and cross-attention maps to enhance fine-grained spatial representation; and (3) a test-time flipping strategy to improve spatial consistency. Evaluated on PASCAL VOC, PASCAL Context, and COCO Object, the method achieves a mean Intersection-over-Union (mIoU) of 43.8%, setting the new state-of-the-art among training-free approaches. Moreover, its inference speed significantly surpasses iterative diffusion-based methods.

Technology Category

Application Category

📝 Abstract

Open-vocabulary semantic segmentation (OVSS) aims to segment objects from arbitrary text categories without requiring densely annotated datasets. Although contrastive learning based models enable zero-shot segmentation, they often lose fine spatial precision at pixel level, due to global representation bias. In contrast, diffusion-based models naturally encode fine-grained spatial features via attention mechanisms that capture both global context and local details. However, they often face challenges in balancing the number of iterations with the quality of the segmentation. In this work, we propose FastSeg, a novel and efficient training-free framework with only (1+1)-step of reverse process of a pretrained diffusion model (e.g., Stable Diffusion). Moreover, instead of running multiple times for different classes, FastSeg performs segmentation for all classes at once. To further enhance the segmentation quality, FastSeg introduces three key components: (i) a dual-prompt mechanism for discriminative, class-aware attention extraction, (ii) a Hierarchical Attention Refinement Method (HARD) that enhances fused cross-attention using scale-aligned selfattention maps, and (iii) a Test-Time Flipping (TTF) scheme designed to improve spatial consistency. Extensive experiments show that FastSeg achieves state-of-the-art training-free performance, obtaining 43.8% average mIoU across PASCAL VOC, PASCAL Context, and COCO Object benchmarks while maintaining superior inference efficiency. Our results demonstrate that FastSeg provides a strong foundation for extendability, bridging the gap between segmentation quality and inference efficiency.

Problem

Research questions and friction points this paper is trying to address.

Improves pixel-level spatial precision in open-vocabulary segmentation

Balances segmentation quality with efficient few-step inference

Enables simultaneous multi-class segmentation without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework with (1+1)-step diffusion

Dual-prompt mechanism for class-aware attention

Hierarchical Attention Refinement Method (HARD)

🔎 Similar Papers

No similar papers found.