$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This work addresses the challenge of camera control in pretrained video generation models without requiring additional training, where the core difficulty lies in balancing trajectory adherence and visual quality under partial observability. To this end, the authors propose a structured sampler refinement strategy that integrates a block-wise conditional pseudo-Gibbs refinement mechanism, performing same-noise-level conditional optimization on unobserved regions within hard-replacement guidance steps. This approach is further enhanced by 3D local block partitioning, an adaptive freezing mechanism, and depth-warping guidance to enable efficient and robust control in high-dimensional latent spaces. Experiments demonstrate that the method achieves significantly lower FVD scores than all seven baseline methods on the RealEstate10K and DAVIS datasets and outperforms existing training-free approaches across all evaluation metrics.

📝 Abstract

Training-free camera control for pretrained flow-matching video generators is a partial-observation inverse problem: a depth-warped guidance video supplies noisy evidence on a subset of latent sites, which the sampler must reconcile with the pretrained prior. Existing methods struggle to balance the trade-off between trajectory adherence and visual quality and the heuristic guidance-strength tuning lacks robustness. We propose \textbf{$h$-control}, which resolves this dilemma through a structural change to the sampler: each outer hard-replacement guidance step is augmented with an inner-loop \emph{block-conditional pseudo-Gibbs refinement} on the unobserved complement at the same noise level, with provable convergence to the partial-observation conditional data law. To accelerate convergence on high-dimensional video latents, we exploit their conditional locality, partitioning the unobserved complement into 3D patches, each tracked by a custom mixing indicator that adaptively freezes converged patches. On RealEstate10K and DAVIS, \textbf{$h$-control} attains the best FVD against all seven training-free and training-based competitors, outperforming every training-free baseline on every reported metric.

Problem

Research questions and friction points this paper is trying to address.

camera control

partial-observation inverse problem

flow-matching video generators

trajectory adherence

visual quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free camera control

block-conditional Gibbs refinement

partial-observation inverse problem