JLT: Clean-Latent Prediction in Latent Diffusion Transformers

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates how the choice of prediction target in latent diffusion models affects the utilization of low-dimensional structures within compressed latent spaces. To this end, the authors propose JLT, a 130-million-parameter latent diffusion Transformer built upon a frozen FLUX.2 VAE encoder that directly predicts clean latent representations rather than noise or velocity. The study demonstrates that the selection of prediction target is fundamentally a geometric decision dictated by the structure of the latent representation, rather than an interchangeable algebraic parameterization. For the first time under a unified setting, the work systematically compares clean latent prediction against velocity prediction, revealing notable performance differences. Combining Flow Matching, a DiT architecture, and Classifier-Free Guidance, JLT achieves an FID-50K of 2.50 on ImageNet 256×256, significantly outperforming velocity-based approaches.
📝 Abstract
Flow matching with clean-data prediction has shown that regressing the clean point can exploit low-dimensional structure more effectively than predicting an ambient noised quantity. We ask whether this principle remains useful after images are mapped into a learned latent space, where compression has already removed much of the raw pixel variability. We introduce JLT, a 130M latent diffusion Transformer over frozen FLUX.2 VAE codes, and compare clean-latent prediction with a matched velocity-prediction DiT under the same representation, backbone, and training settings. Although the three variables x, epsilon, and v are linearly convertible for a fixed corruption time, a local Gaussian analysis shows that velocity regression inherits an isotropic target-covariance floor and amplifies low-variance latent directions, while clean prediction damps them. On ImageNet 256 x 256, JLT-B/1 obtains FID-50K 2.50 with classifier-free guidance, with a large matched-target gap over velocity prediction. These results suggest that prediction targets in latent diffusion are representation-dependent geometric choices, rather than interchangeable algebraic parameterizations.
Problem

Research questions and friction points this paper is trying to address.

latent diffusion
clean-latent prediction
flow matching
representation dependence
diffusion targets
Innovation

Methods, ideas, or system contributions that make the work stand out.

clean-latent prediction
latent diffusion
flow matching
Transformer
representation-dependent geometry
🔎 Similar Papers
No similar papers found.