Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

MM-DiT architectures (e.g., FLUX) suffer from imprecise text–image alignment in text-to-image generation, primarily due to: (1) severe token-count imbalance between visual and textual sequences, which hinders effective cross-modal attention; and (2) timestep-agnostic attention weights, limiting adaptability to the evolving semantic requirements of the diffusion process. To address this, we propose Temperature-Adaptive Cross-modal Attention (TACA), the first method introducing a timestep-aware, dynamically scaled temperature mechanism within diffusion model attention layers. TACA enables parameter-efficient, timestep-sensitive rebalancing of cross-modal interactions without architectural modifications. Integrated with LoRA fine-tuning, it is fully compatible with state-of-the-art models including FLUX and SD3.5. On T2I-CompBench, TACA significantly improves accuracy in object generation, attribute binding, and spatial relation modeling—while incurring negligible computational overhead.

Technology Category

Application Category

📝 Abstract

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose extbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at href{https://github.com/Vchitect/TACA}

Problem

Research questions and friction points this paper is trying to address.

Achieving precise text-image alignment in multimodal diffusion models

Addressing token imbalance in cross-modal attention mechanisms

Improving timestep-aware attention weighting for better semantic fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temperature-Adjusted Cross-modal Attention (TACA)

Dynamic rebalancing of multimodal interactions

Timestep-aware attention weighting adjustment

🔎 Similar Papers

No similar papers found.