๐ค AI Summary
To address the misalignment between user intent and generated images in text-to-image (T2I) synthesis, this paper proposes a self-supervised alignment method grounded in mutual information (MI) maximization. For the first time, pointwise MI estimation is directly integrated into the denoising process of diffusion models, leveraging only internal features of a pretrained denoising networkโwithout requiring auxiliary vision-language models, human annotations, or external multimodal supervision. The approach employs self-supervised fine-tuning coupled with synthetically constructed fine-tuning datasets to achieve fine-grained cross-modal alignment between text and image representations. Experiments demonstrate that our method significantly improves textual alignment fidelity while preserving high visual quality of generated images, outperforming state-of-the-art approaches. Moreover, it exhibits strong generalization, computational efficiency, and parameter-light design.
๐ Abstract
Diffusion models for Text-to-Image (T2I) conditional generation have recently achieved tremendous success. Yet, aligning these models with user's intentions still involves a laborious trial-and-error process, and this challenging alignment problem has attracted considerable attention from the research community. In this work, instead of relying on fine-grained linguistic analyses of prompts, human annotation, or auxiliary vision-language models, we use Mutual Information (MI) to guide model alignment. In brief, our method uses self-supervised fine-tuning and relies on a point-wise (MI) estimation between prompts and images to create a synthetic fine-tuning set for improving model alignment. Our analysis indicates that our method is superior to the state-of-the-art, yet it only requires the pre-trained denoising network of the T2I model itself to estimate MI, and a simple fine-tuning strategy that improves alignment while maintaining image quality.