Gloria: Consistent Character Video Generation via Content Anchors

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of maintaining identity and appearance consistency in long-duration, multi-view character video generation. To this end, the authors propose a content-anchor-based generative framework that leverages content anchors to represent visual attributes of characters. A reference frame set is constructed as a consistency prior, and a superset content anchoring mechanism combined with weakly conditioned RoPE positional encoding is introduced to effectively mitigate copy-paste artifacts and conflicts arising from multiple references. The proposed method significantly enhances cross-view identity consistency and visual coherence, enabling the generation of high-quality character videos exceeding ten minutes in length. Experimental results demonstrate superior performance over existing approaches in both identity expressiveness and visual consistency.
📝 Abstract
Digital characters are central to modern media, yet generating character videos with long-duration, consistent multi-view appearance and expressive identity remains challenging. Existing approaches either provide insufficient context to preserve identity or leverage non-character-centric information as the memory, leading to suboptimal consistency. Recognizing that character video generation inherently resembles an outside-looking-in scenario. In this work, we propose representing the character visual attributes through a compact set of anchor frames. This design provides stable references for consistency, while reference-based video generation inherently faces challenges of copy-pasting and multi-reference conflicts. To address these, we introduce two mechanisms: Superset Content Anchoring, providing intra- and extra-training clip cues to prevent duplication, and RoPE as Weak Condition, encoding positional offsets to distinguish multiple anchors. Furthermore, we construct a scalable pipeline to extract these anchors from massive videos. Experiments show our method generates high-quality character videos exceeding 10 minutes, and achieves expressive identity and appearance consistency across views, surpassing existing methods.
Problem

Research questions and friction points this paper is trying to address.

character video generation
appearance consistency
identity preservation
multi-view consistency
long-duration video
Innovation

Methods, ideas, or system contributions that make the work stand out.

Content Anchors
Character Video Generation
Appearance Consistency
Superset Content Anchoring
RoPE as Weak Condition
🔎 Similar Papers
No similar papers found.
Y
Yuhang Yang
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC
F
Fan Zhang
UNSW
Huaijin Pi
Huaijin Pi
The University of Hong Kong
Computer vision
S
Shuai Guo
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC
Guowei Xu
Guowei Xu
Tsinghua University
Language ModelsReinforcement Learning
W
Wei Zhai
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC
Yang Cao
Yang Cao
University of Science and Technology of China
computer visionimage processing
Z
Zheng-Jun Zha
MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, USTC