🤖 AI Summary
Multimodal Diffusion Transformers (MM-DiTs) exhibit degraded performance in generating rare or fine-grained textual semantics due to weak representational capacity for infrequent concepts acquired during pretraining.
Method: We propose a fine-tuning-free, data- and compute-free variance scaling technique grounded in joint attention mechanisms. Prior to text embedding input, it selectively amplifies the variance of token representations, thereby activating latent sparse semantic knowledge within the model and enhancing responsiveness to low-frequency concepts during cross-modal alignment.
Contribution/Results: Evaluated on text-to-image, text-to-video, and text-guided image editing tasks, our method significantly improves generation accuracy, visual fidelity, and semantic consistency for rare concepts. It demonstrates that inference-time representation modulation alone—without architectural modification or additional training—can effectively unlock implicit semantic capabilities embedded in pretrained MM-DiTs.
📝 Abstract
Starting from flow- and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the boundary with imaginative or rare prompts, which advanced models still falter in generating, since their concepts are often too scarce to leave a strong imprint during pre-training. In this paper, we propose a simple yet effective intervention that surfaces rare semantics inside MM-DiTs without additional training steps, data, denoising-time optimization, or reliance on external modules (e.g., large language models). In particular, the joint-attention mechanism intrinsic to MM-DiT sequentially updates text embeddings alongside image embeddings throughout transformer blocks. We find that by mathematically expanding representational basins around text token embeddings via variance scale-up before the joint-attention blocks, rare semantics clearly emerge in MM-DiT's outputs. Furthermore, our results generalize effectively across text-to-vision tasks, including text-to-image, text-to-video, and text-driven image editing. Our work invites generative models to reveal the semantics that users intend, once hidden yet ready to surface.