Representation-Conditioned Diffusion Models for Guided Training Data Generation

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the scarcity of high-quality labeled data and the high cost of annotation in deep learning by proposing a representation-conditioned latent diffusion model for controllable image synthesis, leveraging self-supervised visual representations such as DINOv2/v3 and CLIP. By integrating pretrained visual representations into the diffusion process, the method substantially enhances both the quality and class coverage of generated samples while enabling efficient data filtering and augmentation. Experiments on ImageNet100 demonstrate that classifiers trained solely on synthetic data generated by this approach achieve a 10.76 percentage point improvement in accuracy over those trained with conventional class-conditional generation methods, and even surpass models trained on real data by 2.0 percentage points.

📝 Abstract

Data availability remains a critical bottleneck in many deep learning applications. Large-scale datasets are often expensive to collect, curate and annotate, which can limit the scalability and applicability of supervised learning methods. In this work, we evaluate the classification performance of models trained on synthetic image datasets produced by generative deep learning. In particular, we use latent diffusion models conditioned on learned representations from DINOv2, DINOv3, and CLIP. Our results demonstrates that this representation-conditioned formulation significantly outperforms class-conditioned generation by a large margin (+10.76 p.p. top-1 accuracy on ImageNet100), by improving sample quality and mode coverage. Furthermore, by scaling the size of the synthetic dataset, we are able to outperform a classifier trained on the real data (+2.0 p.p top-1 accuracy). We also demonstrate how generated images can be used for augmentation purposes, outperforming classical augmentation methods, and how the conditioning space can be used for sample filtering to further improve training value. Collectively, these findings highlight that representation-conditioned diffusion models provide a promising approach for augmenting, complementing, or potentially replacing real-world datasets in large-scale visual learning tasks.

Problem

Research questions and friction points this paper is trying to address.

data availability

training data generation

supervised learning

dataset scalability

data bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

representation-conditioned diffusion

synthetic data generation

DINOv2