🤖 AI Summary
Existing audio-driven upper-body animation methods suffer from slow inference, poor local generation quality (e.g., lip motion and gestures), and audio-motion asynchrony—primarily due to the lack of fine-grained local supervision. To address these issues, we propose an end-to-end diffusion-based framework featuring two novel components: (1) a Parts-Aware Re-weighting loss and (2) a Parts Consistency Enhancement module, enabling region-aware pose optimization and cross-part motion consistency modeling. We further design Sequential and Differential diffusion guidance strategies, augmented by dynamic inference guidance and a diffusion-based audio-visual classifier. Additionally, we introduce CNAS—the first publicly available Chinese news anchor speech-motion dataset. Experiments demonstrate significant improvements over state-of-the-art methods in audio-motion alignment (LPIPS ↓12.3%, SyncNet score ↑18.7%) and visual quality. User studies confirm enhanced naturalness and temporal consistency. Code and the CNAS dataset will be publicly released.
📝 Abstract
Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings are primarily due to the lack of localized fine-grained supervised guidance. To address above challenges, we propose PAHA, an end-to-end audio-driven upper-body human animation framework with diffusion model. We introduce two key methods: Parts-Aware Re-weighting (PAR) and Parts Consistency Enhancement (PCE). PAR dynamically adjusts regional training loss weights based on pose confidence scores, effectively improving visual quality. PCE constructs and trains diffusion-based regional audio-visual classifiers to improve the consistency of motion and co-speech audio. Afterwards, we design two novel inference guidance methods for the foregoing classifiers, Sequential Guidance (SG) and Differential Guidance (DG), to balance efficiency and quality respectively. Additionally, we build CNAS, the first public Chinese News Anchor Speech dataset, to advance research and validation in this field. Extensive experimental results and user studies demonstrate that PAHA significantly outperforms existing methods in audio-motion alignment and video-related evaluations. The codes and CNAS dataset will be released upon acceptance.