FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

📅 2024-03-29
🏛️ IEEE International Joint Conference on Neural Network
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional image segmentation methods—namely, their reliance on large-scale annotated datasets and task-specific fine-tuning—by proposing a zero-shot, training-free open-vocabulary segmentation paradigm. Methodologically, it integrates BLIP-2 for image captioning, Stable Diffusion for discriminative visual feature extraction, and CLIP for cross-modal alignment; pixel-level masks are then generated end-to-end via clustering and binarization. Key contributions include: (i) the first systematic empirical validation that diffusion-based visual representations outperform classical pre-trained models in segmentation tasks; and (ii) the establishment of the first fully training-free open-vocabulary segmentation framework. Experiments demonstrate that the method surpasses most supervised approaches on Pascal VOC and COCO, matches state-of-the-art weakly supervised performance, and significantly improves segmentation accuracy and generalization under open-vocabulary settings.

Technology Category

Application Category

📝 Abstract
Foundation models have exhibited unprecedented capabilities across various domains and tasks. Models like CLIP bridge cross-modal representations, while text-to-image diffusion models excel in realistic image generation. While the complexity of these models makes retraining infeasible, their superior performance has driven research to explore how to efficiently use them for downstream tasks. Our work explores how to leverage these models for dense visual prediction tasks, specifically image segmentation. To avoid the annotation cost or training large diffusion models, we constrain our method to be zero-shot and training-free. Our pipeline, dubbed FreeSeg-Diff, uses open-source foundation models to perform open-vocabulary segmentation as follows: (a) retrieving image caption (via BLIP-2) and visual features (via Stable Diffusion), (b) clustering and binarizing features to form class-agnostic object masks, (c) mapping these masks to textual classes using CLIP with open vocabulary support, and (d) refining coarse masks. FreeSeg-Diff surpasses many training-based methods on Pascal VOC and COCO datasets and delivers competitive results against recent weakly-supervised segmentation approaches. We provide experiments demonstrating the superiority of diffusion model features over other pre-trained models. Project page: https://bcorrad.github.io/freesegdiff/.
Problem

Research questions and friction points this paper is trying to address.

Performing open-vocabulary image segmentation without training requirements
Leveraging diffusion models' spatial representations for dense visual prediction
Eliminating pixel-level annotations for zero-shot segmentation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages diffusion models for visual representations
Combines captioner and CLIP for open-vocabulary mapping
Uses clustering and refinement for segmentation masks
🔎 Similar Papers
No similar papers found.