🤖 AI Summary
This work addresses the challenge of maintaining identity and appearance consistency in long-duration, multi-view character video generation. To this end, the authors propose a content-anchor-based generative framework that leverages content anchors to represent visual attributes of characters. A reference frame set is constructed as a consistency prior, and a superset content anchoring mechanism combined with weakly conditioned RoPE positional encoding is introduced to effectively mitigate copy-paste artifacts and conflicts arising from multiple references. The proposed method significantly enhances cross-view identity consistency and visual coherence, enabling the generation of high-quality character videos exceeding ten minutes in length. Experimental results demonstrate superior performance over existing approaches in both identity expressiveness and visual consistency.
📝 Abstract
Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.