Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the issue of semantic drift and quality degradation commonly observed in multi-turn image editing with Diffusion Transformers (DiTs). From a frequency-domain perspective in the VAE latent space, the study reveals for the first time that cumulative low-frequency drift is the primary cause of semantic misalignment. To mitigate this, the authors propose a plug-and-play alignment method that requires no training, ground-truth priors, or model parameter modifications. By integrating low-pass filtering, frequency-domain decomposition, and exponential moving average, the method dynamically corrects low-frequency deviations in the latent space, making it compatible with both white-box and black-box editors. Experiments demonstrate that this approach significantly enhances semantic consistency and visual fidelity across diverse multi-turn editing scenarios while effectively preserving high-frequency details.

📝 Abstract

Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality degradation.In this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction bias.Based on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an exponential moving average of previous rounds, effectively suppressing accumulated semantic drift while preserving high-frequency details.Our method requires no retraining, ground-truth priors, or access to diffusion parameters, making it applicable to both white-box and black-box DiT editors. For white-box models, VAE-LFA is seamlessly integrated into the editing pipeline by eliminating redundant VAE round trips; for black-box models, it operates via an off-the-shelf VAE to perform inter-round latent alignment.Extensive experiments demonstrate that VAE-LFA improves semantic consistency and visual fidelity across diverse multi-turn editing scenarios, including both controlled and in-the-wild images.

Problem

Research questions and friction points this paper is trying to address.

semantic drift

multi-turn editing

diffusion transformers

VAE latent space

quality degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Low Frequency Alignment

VAE Latent Space

Diffusion Transformer