Latent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image Generation

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing latent diffusion models struggle to generate raw pixels efficiently in an end-to-end manner due to information loss during encoding, reliance on separately trained decoders, and auxiliary distribution modeling. This work proposes Latent Forcing, a novel mechanism that jointly models latent representations and pixel space through a dual-path noise schedule and reordered denoising trajectory, enabling latent variables to serve as an intermediate cache prior to high-frequency detail synthesis. For the first time, this approach directly operates on raw images while preserving the computational efficiency of latent diffusion. The method reveals the critical role of conditional signal timing in generation quality and provides a unified perspective linking tokenizer distillation, conditional generation, and diffusibility. Evaluated on ImageNet under comparable computational budgets, it achieves state-of-the-art performance in pixel-level image generation using diffusion-based Transformers.

Technology Category

Application Category

📝 Abstract
Latent diffusion models excel at generating high-quality images but lose the benefits of end-to-end modeling. They discard information during image encoding, require a separately trained decoder, and model an auxiliary distribution to the raw data. In this paper, we propose Latent Forcing, a simple modification to existing architectures that achieves the efficiency of latent diffusion while operating on raw natural images. Our approach orders the denoising trajectory by jointly processing latents and pixels with separately tuned noise schedules. This allows the latents to act as a scratchpad for intermediate computation before high-frequency pixel features are generated. We find that the order of conditioning signals is critical, and we analyze this to explain differences between REPA distillation in the tokenizer and the diffusion model, conditional versus unconditional generation, and how tokenizer reconstruction quality relates to diffusability. Applied to ImageNet, Latent Forcing achieves a new state-of-the-art for diffusion transformer-based pixel generation at our compute scale.
Problem

Research questions and friction points this paper is trying to address.

latent diffusion
pixel-space generation
image encoding
end-to-end modeling
diffusion trajectory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent Forcing
diffusion trajectory reordering
pixel-space generation
joint latent-pixel processing
noise schedule tuning
🔎 Similar Papers
No similar papers found.
A
Alan Baade
Department of Computer Science, Stanford University, California, USA
Eric Ryan Chan
Eric Ryan Chan
Stanford University
Artificial IntelligenceGraphicsComputer Vision
K
Kyle Sargent
Department of Computer Science, Stanford University, California, USA
Changan Chen
Changan Chen
Stanford University
computer visionmultimodal learningembodied AI
J
Justin Johnson
Department of Computer Science and Engineering, University of Michigan, Michigan, USA
Ehsan Adeli
Ehsan Adeli
Stanford University
Computer VisionComputational NeurosciencePrecision HealthcareAmbient Intelligence
Li Fei-Fei
Li Fei-Fei
Professor of Computer Science, Stanford University
Artificial IntelligenceMachine LearningComputer VisionNeuroscience