Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

📅 2024-09-04
🏛️ arXiv.org
📈 Citations: 22
Influential: 6
📄 PDF
🤖 AI Summary
Existing audio-driven talking-head video generation methods suffer from weak audio-based motion modeling, often relying on hand-crafted spatial templates, resulting in rigid motions and limited naturalness and expressiveness. This paper introduces the first end-to-end audio-conditioned video diffusion model that requires no spatial priors or motion templates. Our approach addresses the problem through three key innovations: (1) a novel temporal module jointly modeling inter- and intra-segment dynamics to explicitly capture long-range motion dependencies; (2) an audio-to-latent mapping module that strengthens semantic alignment between audio and motion; and (3) integration of temporal attention, cross-frame motion modeling, and latent-space alignment mechanisms. Extensive experiments demonstrate that our method achieves state-of-the-art performance across all major metrics—lip-sync accuracy, motion naturalness, visual detail fidelity, and temporal consistency—outperforming prior approaches comprehensively.

Technology Category

Application Category

📝 Abstract
With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.
Problem

Research questions and friction points this paper is trying to address.

Enhancing audio-driven human video generation naturalness
Eliminating need for spatial motion templates
Improving audio-portrait movement correlation
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end audio-only video diffusion model
Inter- and intra-clip temporal module
Audio-to-latents module for motion correlation
🔎 Similar Papers
No similar papers found.
J
Jianwen Jiang
ByteDance
C
Chao Liang
ByteDance
J
Jiaqi Yang
ByteDance
Gaojie Lin
Gaojie Lin
Bytedance
T
Tianyun Zhong
Zhejiang University
Y
Yanbo Zheng
ByteDance