Autoguided Online Data Curation for Diffusion Model Training

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the time and sample efficiency bottlenecks in training generative diffusion models. We propose a unified framework integrating autoguidance and Joint Example Selection for Training (JEST), enabling online data filtering and dynamic optimization during training. Through controlled experiments on 2D synthetic data and 64×64 image generation, we systematically evaluate how different data selection strategies affect generation quality and diversity. Results show that autoguidance consistently improves both sample fidelity and diversity; early-stage AJEST achieves comparable or slightly better data efficiency than autoguidance but suffers from high computational overhead, limiting practical deployment. Our key contribution is the first empirical delineation of the stage-dependent effectiveness boundaries between autoguidance and JEST—revealing that lightweight autoguidance dominates across most training phases in terms of both performance and deployment feasibility.

Technology Category

Application Category

📝 Abstract

The costs of generative model compute rekindled promises and hopes for efficient data curation. In this work, we investigate whether recently developed autoguidance and online data selection methods can improve the time and sample efficiency of training generative diffusion models. We integrate joint example selection (JEST) and autoguidance into a unified code base for fast ablation and benchmarking. We evaluate combinations of data curation on a controlled 2-D synthetic data generation task as well as (3x64x64)-D image generation. Our comparisons are made at equal wall-clock time and equal number of samples, explicitly accounting for the overhead of selection. Across experiments, autoguidance consistently improves sample quality and diversity. Early AJEST (applying selection only at the beginning of training) can match or modestly exceed autoguidance alone in data efficiency on both tasks. However, its time overhead and added complexity make autoguidance or uniform random data selection preferable in most situations. These findings suggest that while targeted online selection can yield efficiency gains in early training, robust sample quality improvements are primarily driven by autoguidance. We discuss limitations and scope, and outline when data selection may be beneficial.

Problem

Research questions and friction points this paper is trying to address.

Improving time and sample efficiency in diffusion model training

Evaluating autoguidance and online data selection methods

Comparing data curation strategies for generative model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoguidance improves sample quality diversity

Early AJEST selection enhances data efficiency

Joint example selection integrated autoguidance benchmarking

🔎 Similar Papers

AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge