IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

πŸ“… 2025-09-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the problem of imprecise multimodal alignment between images and text prompts in diffusion models. To this end, we propose Implicit Multimodal Guidance (IMG), a method that leverages multimodal large language models (MLLMs) to automatically detect image-text mismatches and iteratively refine the diffusion model’s conditional features via a trainable implicit alignerβ€”without requiring external preference data, image editing, or additional training samples. Our key contribution is the first formulation of an end-to-end, differentiable iterative preference optimization objective, enabling plug-and-play integration and compatibility with existing fine-tuning paradigms (e.g., DPO). Extensive experiments on SDXL, SDXL-DPO, and FLUX demonstrate that IMG significantly improves text fidelity while preserving image quality, consistently outperforming state-of-the-art alignment methods across multiple benchmarks.

Technology Category

Application Category

πŸ“ Abstract
Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.
Problem

Research questions and friction points this paper is trying to address.

Calibrating diffusion models for precise multimodal alignment
Addressing misalignments between generated images and prompts
Improving alignment without extra data or editing operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLM to detect multimodal misalignments
Implicit Aligner manipulates diffusion conditioning features
Formulates trainable Iteratively Updated Preference Objective
πŸ”Ž Similar Papers
No similar papers found.