Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

📅 2024-09-04

🏛️ arXiv.org

📈 Citations: 22

✨ Influential: 6

career value

226K/year

🤖 AI Summary

Existing audio-driven talking-head video generation methods suffer from weak audio-based motion modeling, often relying on hand-crafted spatial templates, resulting in rigid motions and limited naturalness and expressiveness. This paper introduces the first end-to-end audio-conditioned video diffusion model that requires no spatial priors or motion templates. Our approach addresses the problem through three key innovations: (1) a novel temporal module jointly modeling inter- and intra-segment dynamics to explicitly capture long-range motion dependencies; (2) an audio-to-latent mapping module that strengthens semantic alignment between audio and motion; and (3) integration of temporal attention, cross-frame motion modeling, and latent-space alignment mechanisms. Extensive experiments demonstrate that our method achieves state-of-the-art performance across all major metrics—lip-sync accuracy, motion naturalness, visual detail fidelity, and temporal consistency—outperforming prior approaches comprehensively.

Technology Category

Application Category

📝 Abstract

With the introduction of diffusion-based video generation techniques, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals to stabilize movements, which may compromise the naturalness and freedom of motion. In this paper, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information from the data to learn natural motion patterns and improving audio-portrait movement correlation. This method removes the need for manually specified spatial motion templates used in existing methods to constrain motion during inference. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios.

Problem

Research questions and friction points this paper is trying to address.

Enhancing audio-driven human video generation naturalness

Eliminating need for spatial motion templates

Improving audio-portrait movement correlation

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end audio-only video diffusion model

Inter- and intra-clip temporal module

Audio-to-latents module for motion correlation

🔎 Similar Papers

No similar papers found.