Growing with the Generator: Self-paced GRPO for Video Generation

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing GRPO paradigms for video generation post-training employ static reward models, leading to rigid evaluation capability, exacerbated distribution shift, and rapid reward saturation—thereby undermining alignment stability and effectiveness. To address this, we propose a capability-aware self-paced GRPO framework that enables co-evolution of the generator and reward model. First, we introduce a dynamic grouping evaluation mechanism to enhance reward discrimination granularity. Second, we design a progressive reward scheduling strategy that automatically transitions the reward focus—from visual fidelity to temporal coherence and fine-grained semantic alignment—based on generated video quality, thereby realizing adaptive curriculum learning. Extensive experiments across multiple backbones on VBench demonstrate significant improvements in both visual quality and temporal-semantic consistency over static-reward GRPO baselines, validating the method’s effectiveness and architectural generalizability.

Technology Category

Application Category

📝 Abstract

Group Relative Policy Optimization (GRPO) has emerged as a powerful reinforcement learning paradigm for post-training video generation models. However, existing GRPO pipelines rely on static, fixed-capacity reward models whose evaluation behavior is frozen during training. Such rigid rewards introduce distributional bias, saturate quickly as the generator improves, and ultimately limit the stability and effectiveness of reinforcement-based alignment. We propose Self-Paced GRPO, a competence-aware GRPO framework in which reward feedback co-evolves with the generator. Our method introduces a progressive reward mechanism that automatically shifts its emphasis from coarse visual fidelity to temporal coherence and fine-grained text-video semantic alignment as generation quality increases. This self-paced curriculum alleviates reward-policy mismatch, mitigates reward exploitation, and yields more stable optimization. Experiments on VBench across multiple video generation backbones demonstrate consistent improvements in both visual quality and semantic alignment over GRPO baselines with static rewards, validating the effectiveness and generality of Self-Paced GRPO.

Problem

Research questions and friction points this paper is trying to address.

Static reward models introduce distributional bias in video generation training

Fixed-capacity rewards saturate quickly as generator quality improves

Rigid reward mechanisms limit stability of reinforcement-based alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive reward mechanism co-evolves with generator

Shifts focus from visual fidelity to temporal coherence

Self-paced curriculum mitigates reward-policy mismatch

🔎 Similar Papers

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

2024-06-28arXiv.orgCitations: 85

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence