Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing latent diffusion models (LDMs) jointly synthesize semantic structure and texture, violating the natural “structure-first, texture-second” generation order—thereby limiting texture fidelity and sampling efficiency. This paper proposes Semantic-First Diffusion (SFD), the first framework to introduce asynchronous denoising of dual latent variables—semantic and textural—via temporal offset: the semantic latent converges earlier, providing a strong semantic anchor for subsequent texture generation. SFD constructs a compact semantic prior by integrating a pretrained vision encoder with a dedicated semantic VAE, and employs independent noise schedules for staged optimization. On ImageNet 256×256, SFD achieves a state-of-the-art FID of 1.04, accelerates training convergence by up to 100×, and serves as a plug-and-play enhancement for existing methods—including ReDi and VA-VAE—improving their performance without architectural modification.

Technology Category

Application Category

📝 Abstract
Latent Diffusion Models (LDMs) inherently follow a coarse-to-fine generation process, where high-level semantic structure is generated slightly earlier than fine-grained texture. This indicates the preceding semantics potentially benefit texture generation by providing a semantic anchor. Recent advances have integrated semantic priors from pretrained visual encoders to further enhance LDMs, yet they still denoise semantic and VAE-encoded texture synchronously, neglecting such ordering. Observing these, we propose Semantic-First Diffusion (SFD), a latent diffusion paradigm that explicitly prioritizes semantic formation. SFD first constructs composite latents by combining a compact semantic latent, which is extracted from a pretrained visual encoder via a dedicated Semantic VAE, with the texture latent. The core of SFD is to denoise the semantic and texture latents asynchronously using separate noise schedules: semantics precede textures by a temporal offset, providing clearer high-level guidance for texture refinement and enabling natural coarse-to-fine generation. On ImageNet 256x256 with guidance, SFD achieves FID 1.06 (LightningDiT-XL) and FID 1.04 (1.0B LightningDiT-XXL), while achieving up to 100x faster convergence than the original DiT. SFD also improves existing methods like ReDi and VA-VAE, demonstrating the effectiveness of asynchronous, semantics-led modeling. Project page and code: https://yuemingpan.github.io/SFD.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Proposes asynchronous denoising for semantic and texture latents
Prioritizes semantic formation to guide texture refinement
Enhances latent diffusion models for faster convergence and better quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asynchronous denoising of semantic and texture latents
Semantic-first latent diffusion with temporal offset
Composite latents from pretrained visual encoder
🔎 Similar Papers
No similar papers found.
Y
Yueming Pan
IAIR, Xi’an Jiaotong University
Ruoyu Feng
Ruoyu Feng
University of Science and Technology of China
Generative ModelsComputer VisionImage/Video Coding for Machine
Q
Qi Dai
Microsoft Research Asia
Y
Yuqi Wang
ByteDance
W
Wenfeng Lin
ByteDance
M
Mingyu Guo
ByteDance
Chong Luo
Chong Luo
Microsoft Research
multimedia communicationscomputer vision
Nanning Zheng
Nanning Zheng
Xi'an Jiaotong University