VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing diffusion-based video prediction models struggle to accurately capture the visual dynamics required for robotic manipulation due to their reliance on likelihood-surrogate objectives, often introducing subtle errors in object pose, spatial relationships, and contact timing. To address this, we propose VAMPO, a framework that formulates the multi-step denoising process as a sequential decision-making problem and directly optimizes the denoising policy in latent space using expert-defined, non-adversarial rewards. We introduce the Euler Hybrid sampler, which injects controllable stochasticity at the initial denoising step to enable low-variance policy gradient estimation while preserving trajectory coherence. Integrated with the GRPO algorithm, our approach significantly improves the accuracy of task-relevant visual dynamics in both simulated and real-world manipulation tasks, thereby enhancing downstream action generation performance and generalization.

Technology Category

Application Category

📝 Abstract

Video action models are an appealing foundation for Vision--Language--Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in object pose, spatial relations, and contact timing that can be amplified by downstream policies. We propose VAMPO, a post-training framework that directly improves visual dynamics in video action models through policy optimization. Our key idea is to formulate multi-step denoising as a sequential decision process and optimize the denoising policy with rewards defined over expert visual dynamics in latent space. To make this optimization practical, we introduce an Euler Hybrid sampler that injects stochasticity only at the first denoising step, enabling tractable low-variance policy-gradient estimation while preserving the coherence of the remaining denoising trajectory. We further combine this design with GRPO and a verifiable non-adversarial reward. Across diverse simulated and real-world manipulation tasks, VAMPO improves task-relevant visual dynamics, leading to better downstream action generation and stronger generalization. The homepage is https://vampo-robot.github.io/VAMPO/.

Problem

Research questions and friction points this paper is trying to address.

visual dynamics

video action models

policy optimization

diffusion models

robot manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

policy optimization

visual dynamics

video action models

denoising policy

Euler Hybrid sampler

🔎 Similar Papers

Open-Vocabulary Action Localization With Iterative Visual Prompting

2024-08-30IEEE AccessCitations: 0