🤖 AI Summary
This work identifies and systematically validates a critical issue in multimodal diffusion Transformers for text-to-image generation: as network depth increases, models exhibit prompt semantic forgetting, leading to degraded instruction-following performance. To address this, the authors propose a training-free prompt re-injection method that reintroduces text prompt representations from early layers into deeper layers during the denoising process, thereby enhancing cross-layer semantic consistency. Leveraging linguistic attribute probing analyses of the text branches in models such as SD3, SD3.5, and FLUX.1, the approach demonstrates significant improvements in instruction adherence across GenEval, DPG, and T2I-CompBench++ benchmarks, while consistently advancing preference, aesthetic, and overall generation quality metrics.
📝 Abstract
Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.