Aligning Text to Image in Diffusion Models is Easier Than You Think

📅 2025-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient semantic alignment in text-to-image diffusion models, this paper proposes a representation alignment (REPA)-based reconstruction training paradigm. Instead of relying solely on positive sample pairs for score matching, REPA incorporates a contrastive learning objective that jointly leverages both positive and negative sample pairs. We further design SoftREPA, a lightweight fine-tuning strategy that employs soft text tokens to achieve efficient cross-modal alignment; we theoretically prove that SoftREPA explicitly increases the mutual information between text and image representations. With fewer than 1 million additional parameters, our method significantly improves semantic consistency in both text-to-image generation and text-guided image editing tasks. Empirical evaluations demonstrate state-of-the-art performance across multiple benchmarks, outperforming existing alignment approaches.

Technology Category

Application Category

📝 Abstract
While recent advancements in generative modeling have significantly improved text-image alignment, some residual misalignment between text and image representations still remains. Although many approaches have attempted to address this issue by fine-tuning models using various reward models, etc., we revisit the challenge from the perspective of representation alignment-an approach that has gained popularity with the success of REPresentation Alignment (REPA). We first argue that conventional text-to-image (T2I) diffusion models, typically trained on paired image and text data (i.e., positive pairs) by minimizing score matching or flow matching losses, is suboptimal from the standpoint of representation alignment. Instead, a better alignment can be achieved through contrastive learning that leverages both positive and negative pairs. To achieve this efficiently even with pretrained models, we introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. This approach improves alignment with minimal computational overhead by adding fewer than 1M trainable parameters to the pretrained model. Our theoretical analysis demonstrates that our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency. Experimental results across text-to-image generation and text-guided image editing tasks validate the effectiveness of our approach in improving the semantic consistency of T2I generative models.
Problem

Research questions and friction points this paper is trying to address.

Improving text-image alignment in diffusion models
Addressing residual misalignment using contrastive learning
Enhancing semantic consistency with minimal computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive learning with positive and negative pairs
Lightweight SoftREPA fine-tuning strategy
Enhances semantic consistency via mutual information
🔎 Similar Papers
No similar papers found.