Consistency-Preserving Diverse Video Generation

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenges of limited training data and the trade-off between diversity and temporal consistency in text-to-video generation. To this end, the authors propose a joint sampling framework that introduces diversity-driven updates in the latent space during flow matching while explicitly removing components detrimental to temporal coherence. Notably, this approach achieves enhanced diversity in video generation without requiring backpropagation through the video decoder—a first in the field—thereby substantially reducing computational overhead. By jointly optimizing diversity and temporal consistency objectives via a lightweight latent-space model, the method circumvents the need for image-space gradients and expensive decoder backpropagation. Experiments demonstrate that the proposed technique matches the diversity of strong baselines on state-of-the-art flow-matching models while significantly improving temporal consistency and color naturalness.

Technology Category

Application Category

📝 Abstract

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.

Problem

Research questions and friction points this paper is trying to address.

text-to-video generation

diversity

temporal consistency

low-sample regime

video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

flow-matching

temporal consistency

diverse video generation