Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

170K/year
🤖 AI Summary
This work addresses the issue of semantic drift and quality degradation commonly observed in multi-turn image editing with Diffusion Transformers (DiTs). From a frequency-domain perspective in the VAE latent space, the study reveals for the first time that cumulative low-frequency drift is the primary cause of semantic misalignment. To mitigate this, the authors propose a plug-and-play alignment method that requires no training, ground-truth priors, or model parameter modifications. By integrating low-pass filtering, frequency-domain decomposition, and exponential moving average, the method dynamically corrects low-frequency deviations in the latent space, making it compatible with both white-box and black-box editors. Experiments demonstrate that this approach significantly enhances semantic consistency and visual fidelity across diverse multi-turn editing scenarios while effectively preserving high-frequency details.
📝 Abstract
Recent advances in diffusion transformers (DiTs) have enabled promising single-turn image editing capabilities. However, multi-turn editing often leads to progressive semantic drift and quality degradation.In this work, we study this problem from a latent-space frequency perspective by decomposing the editing process into two functional components: VAE and DiT. Through systematic analysis in the VAE latent space, we uncover that the DiT introduces dominant low-frequency drift that accumulates as semantic misalignment across editing rounds, while the VAE contributes comparatively stable reconstruction bias.Based on this insight, we propose VAE-LFA (Low Frequency Alignment), a training-free, plug-and-play method that performs alignment in VAE latent space. VAE-LFA decomposes latent discrepancies across editing rounds via low-pass filtering, and aligns low-frequency statistics to an exponential moving average of previous rounds, effectively suppressing accumulated semantic drift while preserving high-frequency details.Our method requires no retraining, ground-truth priors, or access to diffusion parameters, making it applicable to both white-box and black-box DiT editors. For white-box models, VAE-LFA is seamlessly integrated into the editing pipeline by eliminating redundant VAE round trips; for black-box models, it operates via an off-the-shelf VAE to perform inter-round latent alignment.Extensive experiments demonstrate that VAE-LFA improves semantic consistency and visual fidelity across diverse multi-turn editing scenarios, including both controlled and in-the-wild images.
Problem

Research questions and friction points this paper is trying to address.

semantic drift
multi-turn editing
diffusion transformers
VAE latent space
quality degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Low Frequency Alignment
VAE Latent Space
Diffusion Transformer
Semantic Drift
Plug-and-Play Editing