AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of aligning streaming autoregressive video generation models with human preferences under low-stochasticity trajectories, where conventional reinforcement learning from human feedback struggles. To this end, we propose AR-CoPO, the first framework to adapt Contrastive Policy Optimization (CoPO) to this setting. AR-CoPO introduces chunk-level alignment and a trajectory-forking mechanism to construct local policy neighborhoods, combined with semi-online policy updates and a replay buffer to enhance both exploration efficiency and training stability. Evaluated on the Self-Forcing benchmark, AR-CoPO demonstrates substantial improvements in out-of-domain generalization and in-domain alignment with human preferences, confirming genuine alignment rather than reward hacking, thereby effectively resolving the alignment difficulty inherent in low-randomness generation trajectories.

Technology Category

Application Category

📝 Abstract

Streaming autoregressive (AR) video generators combined with few-step distillation achieve low-latency, high-quality synthesis, yet remain difficult to align via reinforcement learning from human feedback (RLHF). Existing SDE-based GRPO methods face challenges in this setting: few-step ODEs and consistency model samplers deviate from standard flow-matching ODEs, and their short, low-stochasticity trajectories are highly sensitive to initialization noise, rendering intermediate SDE exploration ineffective. We propose AR-CoPO (AutoRegressive Contrastive Policy Optimization), a framework that adapts the Neighbor GRPO contrastive perspective to streaming AR generation. AR-CoPO introduces chunk-level alignment via a forking mechanism that constructs neighborhood candidates at a randomly selected chunk, assigns sequence-level rewards, and performs localized GRPO updates. We further propose a semi-on-policy training strategy that complements on-policy exploration with exploitation over a replay buffer of reference rollouts, improving generation quality across domains. Experiments on Self-Forcing demonstrate that AR-CoPO improves both out-of-domain generalization and in-domain human preference alignment over the baseline, providing evidence of genuine alignment rather than reward hacking.

Problem

Research questions and friction points this paper is trying to address.

autoregressive video generation

reinforcement learning from human feedback

alignment

few-step distillation

contrastive policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive Video Generation

Contrastive Policy Optimization

Chunk-level Alignment