Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work proposes Diffusion-DRF, a novel approach to video generation preference optimization that overcomes the limitations of existing methods relying on non-differentiable human annotations or reward models, which often introduce bias, encourage reward hacking, and lead to unstable training. Diffusion-DRF enables end-to-end differentiable reward signal propagation by leveraging a frozen, off-the-shelf vision-language model as a trainable-free critic. It backpropagates logit-level feedback through the diffusion denoising chain to produce token-wise gradients for optimizing the video diffusion model. Requiring neither additional reward models nor preference data, the method effectively mitigates reward hacking and mode collapse while supporting multi-dimensional semantic alignment. Experiments demonstrate that Diffusion-DRF significantly improves visual quality and text-video alignment, enhances training stability, and exhibits model-agnostic generalizability across diverse diffusion-based generation tasks.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) has recently improved Text-to-Video (T2V) generation by enhancing visual fidelity and text alignment. However, current methods rely on non-differentiable preference signals from human annotations or learned reward models. This reliance makes training label-intensive, bias-prone, and easy-to-game, which often triggers reward hacking and unstable training. We propose Diffusion-DRF, a differentiable reward flow for fine-tuning video diffusion models using a frozen, off-the-shelf Vision-Language Model (VLM) as a training-free critic. Diffusion-DRF directly backpropagates VLM feedback through the diffusion denoising chain, converting logit-level responses into token-aware gradients for optimization. We propose an automated, aspect-structured prompting pipeline to obtain reliable multi-dimensional VLM feedback, while gradient checkpointing enables efficient updates through the final denoising steps. Diffusion-DRF improves video quality and semantic alignment while mitigating reward hacking and collapse -- without additional reward models or preference datasets. It is model-agnostic and readily generalizes to other diffusion-based generative tasks.

Problem

Research questions and friction points this paper is trying to address.

Direct Preference Optimization

Text-to-Video generation

non-differentiable reward

reward hacking

training instability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable Reward Flow

Vision-Language Model

Direct Preference Optimization

Diffusion Models

Gradient Backpropagation

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence