IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the problem of imprecise multimodal alignment between images and text prompts in diffusion models. To this end, we propose Implicit Multimodal Guidance (IMG), a method that leverages multimodal large language models (MLLMs) to automatically detect image-text mismatches and iteratively refine the diffusion model’s conditional features via a trainable implicit aligner—without requiring external preference data, image editing, or additional training samples. Our key contribution is the first formulation of an end-to-end, differentiable iterative preference optimization objective, enabling plug-and-play integration and compatibility with existing fine-tuning paradigms (e.g., DPO). Extensive experiments on SDXL, SDXL-DPO, and FLUX demonstrate that IMG significantly improves text fidelity while preserving image quality, consistently outperforming state-of-the-art alignment methods across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

Ensuring precise multimodal alignment between diffusion-generated images and input prompts has been a long-standing challenge. Earlier works finetune diffusion weight using high-quality preference data, which tends to be limited and difficult to scale up. Recent editing-based methods further refine local regions of generated images but may compromise overall image quality. In this work, we propose Implicit Multimodal Guidance (IMG), a novel re-generation-based multimodal alignment framework that requires no extra data or editing operations. Specifically, given a generated image and its prompt, IMG a) utilizes a multimodal large language model (MLLM) to identify misalignments; b) introduces an Implicit Aligner that manipulates diffusion conditioning features to reduce misalignments and enable re-generation; and c) formulates the re-alignment goal into a trainable objective, namely Iteratively Updated Preference Objective. Extensive qualitative and quantitative evaluations on SDXL, SDXL-DPO, and FLUX show that IMG outperforms existing alignment methods. Furthermore, IMG acts as a flexible plug-and-play adapter, seamlessly enhancing prior finetuning-based alignment methods. Our code will be available at https://github.com/SHI-Labs/IMG-Multimodal-Diffusion-Alignment.

Problem

Research questions and friction points this paper is trying to address.

Calibrating diffusion models for precise multimodal alignment

Addressing misalignments between generated images and prompts

Improving alignment without extra data or editing operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses MLLM to detect multimodal misalignments

Implicit Aligner manipulates diffusion conditioning features

Formulates trainable Iteratively Updated Preference Objective

🔎 Similar Papers

No similar papers found.