Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

To address insufficient temporal coherence in image-to-video (I2V) generation, this paper introduces VCD—a novel temporal consistency metric grounded in frequency-domain analysis—and integrates it into a reward-driven fine-tuning framework. Unlike existing reward functions that prioritize holistic attributes (e.g., aesthetic quality or static fidelity), VCD explicitly models the feature-space distance between the input condition image and generated frame sequences within the frequency domain, thereby optimizing temporal coherence anchored to the input image. Crucially, VCD requires no ground-truth video supervision, enabling efficient training by synergizing video diffusion models with feature-level frequency-domain distance measurement. Extensive experiments on multiple I2V benchmarks demonstrate substantial improvements in temporal consistency metrics—including reduced Temporal Video Distance (TVD) and Fréchet Video Distance (FVD)—while preserving or even enhancing visual quality. The proposed method achieves state-of-the-art performance in comprehensive evaluation.

Technology Category

Application Category

📝 Abstract

Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.

Problem

Research questions and friction points this paper is trying to address.

Enhancing temporal consistency in image-to-video generation tasks

Addressing limitations of conventional reward functions for video quality

Improving frame coherence through frequency-domain feature analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Consistency Distance metric enhances temporal consistency

Frequency-domain analysis captures frame information effectively

Reward-based fine-tuning improves video generation without real datasets

🔎 Similar Papers

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way