🤖 AI Summary
Semantic segmentation relies heavily on costly, labor-intensive pixel-level annotations; meanwhile, existing text-to-image synthesis methods struggle to reliably generate multi-instance images with corresponding precise segmentation masks. To address this, we propose the first synergistic framework integrating segmentation-aware diffusion models with text-driven image editing, enabling controllable generation of multi-object images and their pixel-aligned masks in open-world scenarios. Built upon the Stable Diffusion architecture, our method introduces a mask consistency constraint and employs joint optimization during training, ensuring high-fidelity, multi-instance synthesis with accurate mask alignment. Evaluated on the VOC 2012 zero-shot semantic segmentation benchmark, our approach achieves new state-of-the-art performance. Notably, a segmentation model trained solely on our synthetic data surpasses the performance of a model trained on real annotated data—demonstrating substantial alleviation of the annotation bottleneck.
📝 Abstract
Current semantic segmentation models typically require a substantial amount of manually annotated data, a process that is both time-consuming and resource-intensive. Alternatively, leveraging advanced text-to-image models such as Midjourney and Stable Diffusion has emerged as an efficient strategy, enabling the automatic generation of synthetic data in place of manual annotations. However, previous methods have been limited to generating single-instance images, as the generation of multiple instances with Stable Diffusion has proven unstable. To address this limitation and expand the scope and diversity of synthetic datasets, we propose a framework extbf{Free-Mask} that combines a Diffusion Model for segmentation with advanced image editing capabilities, allowing for the integration of multiple objects into images via text-to-image models. Our method facilitates the creation of highly realistic datasets that closely emulate open-world environments while generating accurate segmentation masks. It reduces the labor associated with manual annotation and also ensures precise mask generation. Experimental results demonstrate that synthetic data generated by extbf{Free-Mask} enables segmentation models to outperform those trained on real data, especially in zero-shot settings. Notably, extbf{Free-Mask} achieves new state-of-the-art results on previously unseen classes in the VOC 2012 benchmark.