CyberHost: Taming Audio-driven Avatar Diffusion Model with Region Codebook Attention

๐Ÿ“… 2024-09-03
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 12
โœจ Influential: 3
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses key cross-modal generation challenges in audio-driven human animationโ€”namely, poor hand integrity, identity inconsistency, and unnatural motion. We propose the first end-to-end audio-driven zero-shot human video diffusion model. Methodologically, we introduce Region Codebook Attention to jointly encode fine-grained local region features and motion priors; further, we incorporate human-specific prior-guided training, including body motion map modeling, hand clarity scoring, pose-aligned reference features, and localized enhancement supervision. Quantitative and qualitative evaluations demonstrate that our approach consistently outperforms state-of-the-art methods, significantly improving facial and hand detail fidelity, identity preservation, and overall motion naturalness. Our framework establishes a new paradigm for cross-modal human animation generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion-based video generation technology has advanced significantly, catalyzing a proliferation of research in human animation. However, the majority of these studies are confined to same-modality driving settings, with cross-modality human body animation remaining relatively underexplored. In this paper, we introduce, an end-to-end audio-driven human animation framework that ensures hand integrity, identity consistency, and natural motion. The key design of CyberHost is the Region Codebook Attention mechanism, which improves the generation quality of facial and hand animations by integrating fine-grained local features with learned motion pattern priors. Furthermore, we have developed a suite of human-prior-guided training strategies, including body movement map, hand clarity score, pose-aligned reference feature, and local enhancement supervision, to improve synthesis results. To our knowledge, CyberHost is the first end-to-end audio-driven human diffusion model capable of facilitating zero-shot video generation within the scope of human body. Extensive experiments demonstrate that CyberHost surpasses previous works in both quantitative and qualitative aspects.
Problem

Research questions and friction points this paper is trying to address.

Cross-modality audio-driven human animation
Ensuring hand integrity and identity consistency
Improving facial and hand animation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Region Codebook Attention for fine-grained animation
Human-prior-guided training strategies for synthesis
End-to-end audio-driven zero-shot video generation
๐Ÿ”Ž Similar Papers
Gaojie Lin
Gaojie Lin
Bytedance
J
Jianwen Jiang
ByteDance
C
Chao Liang
ByteDance
T
Tianyun Zhong
Zhejiang University
J
Jiaqi Yang
ByteDance
Y
Yanbo Zheng
ByteDance