SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models

📅 2025-07-26

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses zero-shot, open-vocabulary semantic segmentation without manual annotations, fine-tuning, prompt engineering, or pre-trained segmentation networks. The proposed method leverages Stable Diffusion and introduces a novel joint modeling of cross-attention—enabling coarse-grained semantic localization—and multi-scale self-attention—facilitating fine-grained region propagation—thereby emulating classical seeded segmentation to achieve end-to-end mask generation from text-guided seeds to semantic expansion. Additionally, a background consistency optimization is incorporated to enhance boundary precision. The approach is plug-and-play and requires no task-specific adaptation. Evaluated on PASCAL VOC and COCO, it significantly outperforms existing generative segmentation methods in zero-shot settings. By unifying diffusion-based attention mechanisms with segmentation principles, this work establishes an efficient, interpretable, and annotation-free paradigm for open-vocabulary pixel-level segmentation.

Technology Category

Application Category

📝 Abstract

Entrusted with the goal of pixel-level object classification, the semantic segmentation networks entail the laborious preparation of pixel-level annotation masks. To obtain pixel-level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text-to-image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text-guided Diffusion models and thus require a pre-trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross-attention alone provides very coarse object localization, which however can provide initial seeds. Then, akin to region expansion in seeded segmentation, we utilize the semantic-correspondence-modeling capability of self-attention to iteratively spread the attention to the whole class from the seeds using multi-scale self-attention maps. We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences, compared to complex-structured objects. Thus, we further refine a mask using a more accurate background mask. Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion, without additional training procedure, prompt tuning, or a pre-trained segmentation network.

Problem

Research questions and friction points this paper is trying to address.

Generating pixel-level annotation masks without human effort

Utilizing Stable Diffusion attention mechanisms for object localization

Refining masks using background correspondence for higher accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes cross-attention for initial seed localization

Employs multi-scale self-attention for region expansion

Refines masks using accurate background correspondence

🔎 Similar Papers

Face Mask Removal with Region-attentive Face Inpainting