Alignment of Diffusion Model and Flow Matching for Text-to-Image Generation

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work proposes a unified framework for aligning diffusion and flow-matching models with reward signals, circumventing the high computational cost and limited generalization of conventional alignment methods that rely on full fine-tuning of pretrained generative models. For the first time, the approach adopts a reward-weighted distribution perspective: diffusion models are aligned via score guidance, while flow-matching models employ velocity guidance, both leveraging conditional expectation to estimate reward signals without requiring end-to-end fine-tuning. The method enables single-step generation in diffusion models, reducing computational overhead by 60% while matching the performance of fine-tuned baselines. In flow-matching models, it significantly enhances generation quality without any additional training.

Technology Category

Application Category

📝 Abstract

Diffusion models and flow matching have demonstrated remarkable success in text-to-image generation. While many existing alignment methods primarily focus on fine-tuning pre-trained generative models to maximize a given reward function, these approaches require extensive computational resources and may not generalize well across different objectives. In this work, we propose a novel alignment framework by leveraging the underlying nature of the alignment problem -- sampling from reward-weighted distributions -- and show that it applies to both diffusion models (via score guidance) and flow matching models (via velocity guidance). The score function (velocity field) required for the reward-weighted distribution can be decomposed into the pre-trained score (velocity field) plus a conditional expectation of the reward. For the alignment on the diffusion model, we identify a fundamental challenge: the adversarial nature of the guidance term can introduce undesirable artifacts in the generated images. Therefore, we propose a finetuning-free framework that trains a guidance network to estimate the conditional expectation of the reward. We achieve comparable performance to finetuning-based models with one-step generation with at least a 60% reduction in computational cost. For the alignment on flow matching, we propose a training-free framework that improves the generation quality without additional computational cost.

Problem

Research questions and friction points this paper is trying to address.

alignment

text-to-image generation

diffusion models

flow matching

reward-weighted distributions

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward-weighted sampling

score guidance

velocity guidance

finetuning-free alignment

flow matching

🔎 Similar Papers

No similar papers found.

Authors to Follow