🤖 AI Summary
Existing video generation methods (e.g., VACE, Phantom) struggle to maintain long-term identity consistency in dynamic multi-person interaction scenes. To address this, we propose Identity-GRPO—a novel framework that introduces Reinforcement Learning from Human Feedback (RLHF) to multi-person video generation for the first time. We construct a large-scale preference dataset explicitly designed for identity consistency and develop a GRPO variant tailored to this objective. Our method jointly leverages a video reward model and paired human annotations/synthetic distortion data to enable end-to-end optimization. Experiments demonstrate that Identity-GRPO achieves up to a 18.9% improvement over baselines on human-perceived identity consistency metrics. Ablation studies confirm the critical roles of both high-quality preference data and the customized GRPO architecture. This work establishes a new paradigm for modeling identity consistency in multi-person video generation.
📝 Abstract
While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.