Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

To address unnatural inter-frame motion (e.g., distortion, reversal, freezing) and poor temporal alignment in text-to-video generation, this paper proposes a parameter-free inference-time optimization method. It introduces beam search in the diffusion latent space, integrating lookahead forward reward estimation with a dynamic, prompt-aware, and calibratable alignment reward mechanism. Crucially, the work identifies and characterizes the misalignment between existing naturalness metrics and human perceptual judgments, then designs a reward-weighting scheme to improve evaluation consistency. Experiments demonstrate that our approach significantly outperforms greedy search and Best-of-N sampling in perceptual video quality. Furthermore, we derive an optimal computational allocation strategy—balancing search budget, lookahead steps, and denoising steps—that enables efficient, high-fidelity video generation. This establishes a new paradigm for inference-time optimization in diffusion-based video synthesis.

Technology Category

Application Category

📝 Abstract

The remarkable progress in text-to-video diffusion models enables photorealistic generations, although the contents of the generated video often include unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select better diffusion latent to maximize a given alignment reward, at inference time. We then point out that the improvement of perceptual video quality considering the alignment to prompts requires reward calibration by weighting existing metrics. When evaluating outputs by using vision language models as a proxy of humans, many previous metrics to quantify the naturalness of video do not always correlate with evaluation and also depend on the degree of dynamic descriptions in evaluation prompts. We demonstrate that our method improves the perceptual quality based on the calibrated reward, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling. We provide practical guidelines on which axes, among search budget, lookahead steps for reward estimate, and denoising steps, in the reverse diffusion process, we should allocate the inference-time computation.

Problem

Research questions and friction points this paper is trying to address.

Video Generation

Realism Enhancement

Natural Motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Enhanced Latent Beam Search

Video Generation Quality

Computational Resource Allocation

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs