OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation

πŸ“… 2025-06-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing audio-driven human animation methods primarily focus on facial synthesis, struggling to achieve natural full-body motion, precise temporal synchronization, and fine-grained textual control. To address these limitations, we propose a novel framework for generating full-body virtual avatar videos. Our approach introduces a pixel-level multi-scale audio embedding mechanism to enhance spatiotemporal modeling of audio features within the diffusion model’s latent space. Coupled with LoRA-based fine-tuning, it enables high-fidelity lip synchronization and fluid whole-body motion while preserving strong text controllability. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches on facial and upper-body video generation: lip-sync error decreases by 21.3%, and motion fluency (measured by FID) improves by 34.7%. The framework supports diverse high-quality applications, including podcast avatars, human-computer interaction, and singing-and-dancing synthesis.

Technology Category

Application Category

πŸ“ Abstract
Significant progress has been made in audio-driven human animation, while most existing methods focus mainly on facial movements, limiting their ability to create full-body animations with natural synchronization and fluidity. They also struggle with precise prompt control for fine-grained generation. To tackle these challenges, we introduce OmniAvatar, an innovative audio-driven full-body video generation model that enhances human animation with improved lip-sync accuracy and natural movements. OmniAvatar introduces a pixel-wise multi-hierarchical audio embedding strategy to better capture audio features in the latent space, enhancing lip-syncing across diverse scenes. To preserve the capability for prompt-driven control of foundation models while effectively incorporating audio features, we employ a LoRA-based training approach. Extensive experiments show that OmniAvatar surpasses existing models in both facial and semi-body video generation, offering precise text-based control for creating videos in various domains, such as podcasts, human interactions, dynamic scenes, and singing. Our project page is https://omni-avatar.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Generating full-body animations with natural audio synchronization
Enhancing lip-sync accuracy and movement fluidity in animations
Achieving precise text-based control for fine-grained video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive full-body animation with natural synchronization
Pixel-wise multi-hierarchical audio embedding strategy
LoRA-based training for prompt-driven control
πŸ”Ž Similar Papers
No similar papers found.