🤖 AI Summary
Traditional data augmentation struggles to manipulate high-level semantic attributes (e.g., material, texture), while existing diffusion-based approaches often incur prohibitive computational costs or yield low-fidelity generations. To address this, we propose a training-free synthetic data generation framework that synergistically integrates a pretrained ControlNet with a vision-language model (VLM) to enable multi-path prompt generation, automatic semantic mask construction, and high-quality image filtering. Our method efficiently produces diverse, pixel-accurately annotated images without human annotation, significantly boosting downstream task performance. In few-shot semantic segmentation on PASCAL-5i and COCO-20i, it surpasses current state-of-the-art methods, demonstrating superior semantic fidelity, diversity, and practical utility.
📝 Abstract
Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.