LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently searching for high-quality initial noise in video diffusion models during inference, where reward signals are typically delayed, sparse, and computationally expensive. To overcome this limitation, the authors propose a latent reward-guided mechanism that enables effective evaluation and optimization of candidate samples at any denoising timestep. The approach, termed Reward-Guided Resampling and Pruning (RGRP), integrates a latent reward model, a resampling strategy based on normalized reward probabilities, and a cumulative reward pruning technique. Experimental results demonstrate that RGRP significantly enhances both inference efficiency and generation quality, consistently outperforming the Wan2.1 baseline across multiple dimensions—including visual fidelity, motion dynamics, and text alignment—on the VBench-2.0 benchmark.

Technology Category

Application Category

📝 Abstract
The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of "golden noise" that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model.
Problem

Research questions and friction points this paper is trying to address.

inference-time scaling
video diffusion
reward signal
computational cost
noise optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent reward guidance
inference-time scaling
video diffusion
reward-guided search
RGRP
🔎 Similar Papers