🤖 AI Summary
Existing GRPO methods struggle to effectively leverage reward signals in image-to-video generation, leading to limitations in both visual quality and temporal consistency. This work proposes TAGRPO, a novel framework that introduces a trajectory alignment mechanism in the latent space. By generating rollout videos from shared initial noise to form positive and negative sample pairs, TAGRPO employs contrastive learning to reinforce high-reward trajectories while suppressing low-reward ones. Additionally, a video memory bank is incorporated to enhance sample diversity and training efficiency. Integrating Group Relative Policy Optimization, flow matching, and contrastive learning, TAGRPO significantly outperforms DanceGRPO on image-to-video generation tasks, producing videos with superior visual fidelity and improved temporal coherence.
📝 Abstract
Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.