Asynchronous Denoising Diffusion Models for Aligning Text-to-Image Generation

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Diffusion models generate high-fidelity images but suffer from suboptimal text–image alignment, primarily due to synchronous denoising—where all pixels share identical timesteps, depriving semantically critical regions of precise contextual guidance. To address this, we propose the Asynchronous Denoising Diffusion Model (ADDM), the first diffusion framework to introduce pixel-level dynamic timestep scheduling: it adaptively assigns distinct denoising rates across spatial locations based on textual relevance, enabling earlier and more gradual convergence in semantic key regions and thereby enhancing alignment via richer local and global context. ADDM comprises three core components: pixel-wise timestep modulation, a learnable dynamic scheduling mechanism, and a context-aware training strategy. Extensive evaluations across multiple benchmarks demonstrate that ADDM significantly improves semantic consistency—as measured by CLIP-Score and human evaluation—while preserving image fidelity.

Technology Category

Application Category

📝 Abstract

Diffusion models have achieved impressive results in generating high-quality images. Yet, they often struggle to faithfully align the generated images with the input prompts. This limitation arises from synchronous denoising, where all pixels simultaneously evolve from random noise to clear images. As a result, during generation, the prompt-related regions can only reference the unrelated regions at the same noise level, failing to obtain clear context and ultimately impairing text-to-image alignment. To address this issue, we propose asynchronous diffusion models -- a novel framework that allocates distinct timesteps to different pixels and reformulates the pixel-wise denoising process. By dynamically modulating the timestep schedules of individual pixels, prompt-related regions are denoised more gradually than unrelated regions, thereby allowing them to leverage clearer inter-pixel context. Consequently, these prompt-related regions achieve better alignment in the final images. Extensive experiments demonstrate that our asynchronous diffusion models can significantly improve text-to-image alignment across diverse prompts. The code repository for this work is available at https://github.com/hu-zijing/AsynDM.

Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image alignment in diffusion model generation

Addressing limitations of synchronous pixel denoising processes

Enhancing prompt-related region development using clearer context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous diffusion models allocate distinct timesteps to pixels

Dynamic modulation of timestep schedules for gradual denoising

Improved text-to-image alignment through clearer inter-pixel context

🔎 Similar Papers

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation