Free-Mask: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing

📅 2024-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Semantic segmentation relies heavily on costly, labor-intensive pixel-level annotations; meanwhile, existing text-to-image synthesis methods struggle to reliably generate multi-instance images with corresponding precise segmentation masks. To address this, we propose the first synergistic framework integrating segmentation-aware diffusion models with text-driven image editing, enabling controllable generation of multi-object images and their pixel-aligned masks in open-world scenarios. Built upon the Stable Diffusion architecture, our method introduces a mask consistency constraint and employs joint optimization during training, ensuring high-fidelity, multi-instance synthesis with accurate mask alignment. Evaluated on the VOC 2012 zero-shot semantic segmentation benchmark, our approach achieves new state-of-the-art performance. Notably, a segmentation model trained solely on our synthetic data surpasses the performance of a model trained on real annotated data—demonstrating substantial alleviation of the annotation bottleneck.

Technology Category

Application Category

📝 Abstract
Current semantic segmentation models typically require a substantial amount of manually annotated data, a process that is both time-consuming and resource-intensive. Alternatively, leveraging advanced text-to-image models such as Midjourney and Stable Diffusion has emerged as an efficient strategy, enabling the automatic generation of synthetic data in place of manual annotations. However, previous methods have been limited to generating single-instance images, as the generation of multiple instances with Stable Diffusion has proven unstable. To address this limitation and expand the scope and diversity of synthetic datasets, we propose a framework extbf{Free-Mask} that combines a Diffusion Model for segmentation with advanced image editing capabilities, allowing for the integration of multiple objects into images via text-to-image models. Our method facilitates the creation of highly realistic datasets that closely emulate open-world environments while generating accurate segmentation masks. It reduces the labor associated with manual annotation and also ensures precise mask generation. Experimental results demonstrate that synthetic data generated by extbf{Free-Mask} enables segmentation models to outperform those trained on real data, especially in zero-shot settings. Notably, extbf{Free-Mask} achieves new state-of-the-art results on previously unseen classes in the VOC 2012 benchmark.
Problem

Research questions and friction points this paper is trying to address.

Reducing manual annotation for semantic segmentation models
Generating stable multi-instance synthetic data via diffusion
Enhancing segmentation accuracy in zero-shot settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Diffusion Model with image editing
Generates multiple objects via text-to-image
Produces accurate masks for synthetic data