🤖 AI Summary
Existing identity-preserving video generation methods typically support only a single identity reference, struggle with multimodal inputs, and often produce blurry reconstructions with poor controllability due to insufficient identity information. To address these limitations, this work proposes AnyID, a novel framework featuring a unified identity representation architecture that flexibly integrates arbitrary visual references—such as face images, portrait sketches, or video clips. Generation is guided by a primary reference anchor, while a differential prompting mechanism enables fine-grained, attribute-level control. Trained on large-scale data and further refined via reinforcement learning from human preferences, AnyID significantly outperforms existing approaches in both identity fidelity and prompt-driven controllability, enabling highly consistent video generation across diverse scenes and heterogeneous input formats.
📝 Abstract
Identity-preserving video generation offers powerful tools for creative expression, allowing users to customize videos featuring their beloved characters. However, prevailing methods are typically designed and optimized for a single identity reference. This underlying assumption restricts creative flexibility by inadequately accommodating diverse real-world input formats. Relying on a single source also constitutes an ill-posed scenario, causing an inherently ambiguous setting that makes it difficult for the model to faithfully reproduce an identity across novel contexts. To address these issues, we present AnyID, an ultra-fidelity identity-preservation video generation framework that features two core contributions. First, we introduce a scalable omni-referenced architecture that effectively unifies heterogeneous identity inputs (e.g., faces, portraits, and videos) into a cohesive representation. Second, we propose a primary-referenced generation paradigm, which designates one reference as a canonical anchor and uses a novel differential prompt to enable precise, attribute-level controllability. We conduct training on a large-scale, meticulously curated dataset to ensure robustness and high fidelity, and then perform a final fine-tuning stage using reinforcement learning. This process leverages a preference dataset constructed from human evaluations, where annotators performed pairwise comparisons of videos based on two key criteria: identity fidelity and prompt controllability. Extensive evaluations validate that AnyID achieves ultra-high identity fidelity as well as superior attribute-level controllability across different task settings.