Data Factory with Minimal Human Effort Using VLMs

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Traditional data augmentation struggles to manipulate high-level semantic attributes (e.g., material, texture), while existing diffusion-based approaches often incur prohibitive computational costs or yield low-fidelity generations. To address this, we propose a training-free synthetic data generation framework that synergistically integrates a pretrained ControlNet with a vision-language model (VLM) to enable multi-path prompt generation, automatic semantic mask construction, and high-quality image filtering. Our method efficiently produces diverse, pixel-accurately annotated images without human annotation, significantly boosting downstream task performance. In few-shot semantic segmentation on PASCAL-5i and COCO-20i, it surpasses current state-of-the-art methods, demonstrating superior semantic fidelity, diversity, and practical utility.

Technology Category

Application Category

📝 Abstract

Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.

Problem

Research questions and friction points this paper is trying to address.

Generating diverse synthetic images with pixel-level labels

Reducing computational costs of diffusion-based data generation

Eliminating manual annotation needs for semantic segmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free pipeline with ControlNet and VLMs

Multi-way modules enhance fidelity and diversity

Generates synthetic images with pixel-level labels

🔎 Similar Papers

Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow

2023-06-12arXiv.orgCitations: 50

Bosch Group

Renningen, BW, DE

PhD – Generative Models for Closed-loop Synthesis

Bosch Group

Renningen, BW, DE

AI Research Scientist, VLM (vision language models)