JoDiffusion: Jointly Diffusing Image with Pixel-Level Annotations for Semantic Segmentation Promotion

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pixel-level annotation is prohibitively expensive, and existing synthetic data approaches often suffer from image–mask misalignment and poor scalability. To address these challenges, we propose JoDiffusion—the first text-driven framework for joint generation of images and pixel-accurate semantic masks. Our method introduces three key innovations: (1) a joint image–mask latent diffusion model enabling end-to-end semantic alignment; (2) an annotation-specific VAE that maps semantic masks into a shared latent space with images; and (3) a mask-aware noise suppression strategy to enhance structural fidelity of generated masks. Evaluated on Pascal VOC, COCO, and ADE20K, models trained solely on JoDiffusion-synthesized data significantly outperform prior synthetic-data methods in downstream semantic segmentation tasks. Results demonstrate both high-fidelity mask generation and strong scalability—establishing JoDiffusion as a robust, general-purpose solution for scalable, high-quality semantic annotation synthesis.

Technology Category

Application Category

📝 Abstract
Given the inherently costly and time-intensive nature of pixel-level annotation, the generation of synthetic datasets comprising sufficiently diverse synthetic images paired with ground-truth pixel-level annotations has garnered increasing attention recently for training high-performance semantic segmentation models. However, existing methods necessitate to either predict pseudo annotations after image generation or generate images conditioned on manual annotation masks, which incurs image-annotation semantic inconsistency or scalability problem. To migrate both problems with one stone, we present a novel dataset generative diffusion framework for semantic segmentation, termed JoDiffusion. Firstly, given a standard latent diffusion model, JoDiffusion incorporates an independent annotation variational auto-encoder (VAE) network to map annotation masks into the latent space shared by images. Then, the diffusion model is tailored to capture the joint distribution of each image and its annotation mask conditioned on a text prompt. By doing these, JoDiffusion enables simultaneously generating paired images and semantically consistent annotation masks solely conditioned on text prompts, thereby demonstrating superior scalability. Additionally, a mask optimization strategy is developed to mitigate the annotation noise produced during generation. Experiments on Pascal VOC, COCO, and ADE20K datasets show that the annotated dataset generated by JoDiffusion yields substantial performance improvements in semantic segmentation compared to existing methods.
Problem

Research questions and friction points this paper is trying to address.

Generates synthetic images with consistent pixel-level annotations
Addresses scalability and semantic inconsistency in dataset creation
Enhances semantic segmentation model performance via optimized synthetic data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint diffusion of images and pixel-level annotation masks
Independent annotation VAE network for mask encoding
Mask optimization strategy to reduce annotation noise
🔎 Similar Papers
No similar papers found.
H
Haoyu Wang
School of Computer Science, Northwestern Polytechnical University
L
Lei Zhang
School of Computer Science, Northwestern Polytechnical University
Wenrui Liu
Wenrui Liu
Zhejiang University
time seriesmulti-modalLLM
Dengyang Jiang
Dengyang Jiang
Northwestern Polytechnical University
Computer VisionDeep LearningMachine Learning
W
Wei Wei
School of Computer Science, Northwestern Polytechnical University
C
Chen Ding
School of Computer Science & Technology, Xi’an University of Posts & Telecommunications