Data Factory with Minimal Human Effort Using VLMs

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional data augmentation struggles to manipulate high-level semantic attributes (e.g., material, texture), while existing diffusion-based approaches often incur prohibitive computational costs or yield low-fidelity generations. To address this, we propose a training-free synthetic data generation framework that synergistically integrates a pretrained ControlNet with a vision-language model (VLM) to enable multi-path prompt generation, automatic semantic mask construction, and high-quality image filtering. Our method efficiently produces diverse, pixel-accurately annotated images without human annotation, significantly boosting downstream task performance. In few-shot semantic segmentation on PASCAL-5i and COCO-20i, it surpasses current state-of-the-art methods, demonstrating superior semantic fidelity, diversity, and practical utility.

Technology Category

Application Category

📝 Abstract
Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.
Problem

Research questions and friction points this paper is trying to address.

Generating diverse synthetic images with pixel-level labels
Reducing computational costs of diffusion-based data generation
Eliminating manual annotation needs for semantic segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free pipeline with ControlNet and VLMs
Multi-way modules enhance fidelity and diversity
Generates synthetic images with pixel-level labels
🔎 Similar Papers
No similar papers found.