KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
Existing reinforcement learning approaches struggle to align streaming autoregressive video generators with human preferences due to their reliance on noisy exploration and stochastic differential equation (SDE)-based proxy policies, which are incompatible with the deterministic ordinary differential equation (ODE) dynamics underlying modern video diffusion models—thereby limiting optimization of long-horizon semantic coherence. To address this, this work proposes KVPO, a native ODE-compatible online grouped relative policy optimization framework that shifts the source of exploration from random noise to historical key-value (KV) caches, enabling causal semantic rerouting to generate diverse and semantically rich video branches. KVPO further introduces an ODE-aligned velocity-field proxy policy and a trajectory velocity energy (TVE)-driven reward-weighted contrastive objective. Experiments demonstrate that KVPO significantly improves visual quality, motion smoothness, and text-video alignment in both single-prompt short videos and multi-prompt long-form video generation.
📝 Abstract
Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.
Problem

Research questions and friction points this paper is trying to address.

autoregressive video generation
human preference alignment
semantic coherence
ODE dynamics
long-horizon video
Innovation

Methods, ideas, or system contributions that make the work stand out.

KVPO
ODE-native
semantic exploration
velocity-field policy
autoregressive video alignment
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30