Harness Local Rewards for Global Benefits: Effective Text-to-Video Generation Alignment with Patch-level Reward Models

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing text-to-video generation models (VGMs) optimize for global video quality while neglecting local spatiotemporal distortions, leading to frequent semantic and temporal inconsistencies. To address this, we propose HALO, a post-training alignment framework featuring a novel dual-scale (patch-level and video-level) reward coordination mechanism. We introduce a multi-scale reward model distilled from GPT-4o to provide fine-grained, interpretable feedback, and design Gran-DPO—the first direct preference optimization algorithm adapted to diffusion models at the patch level. Experiments show that our patch-level reward achieves high agreement with human annotations (Spearman ρ > 0.89). HALO consistently outperforms state-of-the-art methods on both two-stage and end-to-end evaluation benchmarks, and quantitatively mitigates over 72% of localized semantic and temporal defects.

Technology Category

Application Category

📝 Abstract

The emergence of diffusion models (DMs) has significantly improved the quality of text-to-video generation models (VGMs). However, current VGM optimization primarily emphasizes the global quality of videos, overlooking localized errors, which leads to suboptimal generation capabilities. To address this issue, we propose a post-training strategy for VGMs, HALO, which explicitly incorporates local feedback from a patch reward model, providing detailed and comprehensive training signals with the video reward model for advanced VGM optimization. To develop an effective patch reward model, we distill GPT-4o to continuously train our video reward model, which enhances training efficiency and ensures consistency between video and patch reward distributions. Furthermore, to harmoniously integrate patch rewards into VGM optimization, we introduce a granular DPO (Gran-DPO) algorithm for DMs, allowing collaborative use of both patch and video rewards during the optimization process. Experimental results indicate that our patch reward model aligns well with human annotations and HALO substantially outperforms the baselines across two evaluation methods. Further experiments quantitatively prove the existence of patch defects, and our proposed method could effectively alleviate this issue.

Problem

Research questions and friction points this paper is trying to address.

Improve text-to-video generation quality

Address localized errors in video generation

Integrate patch-level rewards for optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Patch-level reward models

GPT-4 distillation

Granular DPO algorithm

🔎 Similar Papers

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

2024-10-08Citations: 0

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

2024-08-15arXiv.orgCitations: 7