Rare Text Semantics Were Always There in Your Diffusion Transformer

📅 2025-10-04

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Multimodal Diffusion Transformers (MM-DiTs) exhibit degraded performance in generating rare or fine-grained textual semantics due to weak representational capacity for infrequent concepts acquired during pretraining. Method: We propose a fine-tuning-free, data- and compute-free variance scaling technique grounded in joint attention mechanisms. Prior to text embedding input, it selectively amplifies the variance of token representations, thereby activating latent sparse semantic knowledge within the model and enhancing responsiveness to low-frequency concepts during cross-modal alignment. Contribution/Results: Evaluated on text-to-image, text-to-video, and text-guided image editing tasks, our method significantly improves generation accuracy, visual fidelity, and semantic consistency for rare concepts. It demonstrates that inference-time representation modulation alone—without architectural modification or additional training—can effectively unlock implicit semantic capabilities embedded in pretrained MM-DiTs.

Technology Category

Application Category

📝 Abstract

Starting from flow- and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the boundary with imaginative or rare prompts, which advanced models still falter in generating, since their concepts are often too scarce to leave a strong imprint during pre-training. In this paper, we propose a simple yet effective intervention that surfaces rare semantics inside MM-DiTs without additional training steps, data, denoising-time optimization, or reliance on external modules (e.g., large language models). In particular, the joint-attention mechanism intrinsic to MM-DiT sequentially updates text embeddings alongside image embeddings throughout transformer blocks. We find that by mathematically expanding representational basins around text token embeddings via variance scale-up before the joint-attention blocks, rare semantics clearly emerge in MM-DiT's outputs. Furthermore, our results generalize effectively across text-to-vision tasks, including text-to-image, text-to-video, and text-driven image editing. Our work invites generative models to reveal the semantics that users intend, once hidden yet ready to surface.

Problem

Research questions and friction points this paper is trying to address.

Generating rare concepts from imaginative prompts in diffusion transformers

Surfacing hidden semantics without additional training or external modules

Enhancing text-to-vision generation across images, videos and editing tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expanding text token embedding basins mathematically

Scaling up variance before joint-attention blocks

Surfacing rare semantics without additional training

🔎 Similar Papers

No similar papers found.