🤖 AI Summary
Video identity customization faces two key challenges: identity feature degradation and diminished dynamic expressiveness in long videos—stemming from the inherent limitations of conventional static-image-based self-reconstruction paradigms, which struggle to jointly preserve identity consistency and motion richness. To address this, we propose a user-preference-driven end-to-end optimization framework. Our approach introduces, for the first time, an explicit dual-reward mechanism that jointly models identity fidelity and dynamic quality, supported by a novel pairwise preference dataset. We further design a hybrid “static-first + frontier-sampling” strategy to overcome self-reconstruction constraints, and incorporate a static-video-guided module that simultaneously preserves identity and enhances dynamic quality. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches across multiple quantitative and qualitative metrics, achieving concurrent improvements in identity fidelity and motion coherence—particularly in long-duration videos.
📝 Abstract
Video identity customization seeks to produce high-fidelity videos that maintain consistent identity and exhibit significant dynamics based on users' reference images. However, existing approaches face two key challenges: identity degradation over extended video length and reduced dynamics during training, primarily due to their reliance on traditional self-reconstruction training with static images. To address these issues, we introduce $ extbf{MagicID}$, a novel framework designed to directly promote the generation of identity-consistent and dynamically rich videos tailored to user preferences. Specifically, we propose constructing pairwise preference video data with explicit identity and dynamic rewards for preference learning, instead of sticking to the traditional self-reconstruction. To address the constraints of customized preference data, we introduce a hybrid sampling strategy. This approach first prioritizes identity preservation by leveraging static videos derived from reference images, then enhances dynamic motion quality in the generated videos using a Frontier-based sampling method. By utilizing these hybrid preference pairs, we optimize the model to align with the reward differences between pairs of customized preferences. Extensive experiments show that MagicID successfully achieves consistent identity and natural dynamics, surpassing existing methods across various metrics.