🤖 AI Summary
Existing open-vocabulary semantic segmentation methods struggle to accurately localize object regions given only class names (without full sentences), primarily due to insufficient cross-modal alignment between vision and language. This paper proposes a test-time, image-driven text prompt inversion mechanism that explicitly injects image structural information into the text embedding space, enabling generation of class-aware and structurally consistent pixel-level masks. Key contributions include: (1) the first diffusion-model attention map–based visual-text association modeling and prompt embedding space inversion; (2) a contrastive soft clustering (CSC) loss that enforces pixel-level structural awareness via intra-class compactness and inter-class separation; and (3) a structure-guided mask optimization strategy. The method achieves state-of-the-art performance among open-source approaches on PASCAL VOC, PASCAL Context, and COCO Object benchmarks.
📝 Abstract
Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information, softly selecting anchors for each class and calculating weighted distances to push inner-class pixels closer while separating inter-class pixels, thereby ensuring mask distinction and internal consistency. By incorporating sample-specific context, InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities. Experiments show that InvSeg achieves state-of-the-art performance on the PASCAL VOC, PASCAL Context and COCO Object datasets.