Planning with Sketch-Guided Verification for Physics-Aware Video Generation

๐Ÿ“… 2025-11-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video generation methods face a dilemma in motion planning: single-step planning lacks expressivity for complex physical motions, while iterative optimization incurs prohibitive computational overhead. This paper introduces SketchVerifyโ€”a training-free, lightweight, sketch-guided verification framework that jointly ensures semantic consistency and physical plausibility via a test-time sampling-and-verification loop: โ€œsketch rendering โ†’ multi-candidate trajectory generation โ†’ vision-language verifier-based ranking โ†’ trajectory selection and refinement.โ€ Its core innovation lies in decoupling motion planning from video synthesis: lightweight sketches and frozen vision-language models enable efficient trajectory quality assessment, eliminating repeated diffusion model invocations. Evaluated on WorldModelBench and PhyWorldBench, SketchVerify significantly improves motion coherence, physical realism, and long-horizon consistency, while reducing computational cost by up to an order of magnitude.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent video generation approaches increasingly rely on planning intermediate control signals such as object trajectories to improve temporal coherence and motion fidelity. However, these methods mostly employ single-shot plans that are typically limited to simple motions, or iterative refinement which requires multiple calls to the video generator, incuring high computational cost. To overcome these limitations, we propose SketchVerify, a training-free, sketch-verification-based planning framework that improves motion planning quality with more dynamically coherent trajectories (i.e., physically plausible and instruction-consistent motions) prior to full video generation by introducing a test-time sampling and verification loop. Given a prompt and a reference image, our method predicts multiple candidate motion plans and ranks them using a vision-language verifier that jointly evaluates semantic alignment with the instruction and physical plausibility. To efficiently score candidate motion plans, we render each trajectory as a lightweight video sketch by compositing objects over a static background, which bypasses the need for expensive, repeated diffusion-based synthesis while achieving comparable performance. We iteratively refine the motion plan until a satisfactory one is identified, which is then passed to the trajectory-conditioned generator for final synthesis. Experiments on WorldModelBench and PhyWorldBench demonstrate that our method significantly improves motion quality, physical realism, and long-term consistency compared to competitive baselines while being substantially more efficient. Our ablation study further shows that scaling up the number of trajectory candidates consistently enhances overall performance.
Problem

Research questions and friction points this paper is trying to address.

Improving motion planning quality for physics-aware video generation
Overcoming limitations of single-shot plans and iterative refinement methods
Achieving physically plausible trajectories before full video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

SketchVerify framework uses test-time sampling and verification
Renders trajectories as lightweight sketches for efficient scoring
Iteratively refines motion plans before final video synthesis
๐Ÿ”Ž Similar Papers
No similar papers found.