InfinityHuman: Towards Long-Term Audio-Driven Human

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Audio-driven human animation faces challenges in generating high-resolution, long-duration videos, including visual inconsistency, hand motion distortion, and poor lip synchronization. To address these issues, we propose a coarse-to-fine generation framework: first learning audio-visual synchronized implicit representations, then employing a pose-guided refinement network that decouples pose and appearance modeling to suppress temporal drift; further introducing a visual anchoring mechanism to enhance inter-frame consistency, and designing a hand-specific reward function trained on high-quality data, integrated with reinforcement learning to jointly optimize gesture semantic fidelity and lip-sync accuracy. Evaluated on EMTD and HDTF benchmarks, our method significantly mitigates identity drift and color shift, achieving state-of-the-art performance in video quality, hand naturalness, and lip-sync precision.

Technology Category

Application Category

📝 Abstract

Audio-driven human animation has attracted wide attention thanks to its practical applications. However, critical challenges remain in generating high-resolution, long-duration videos with consistent appearance and natural hand motions. Existing methods extend videos using overlapping motion frames but suffer from error accumulation, leading to identity drift, color shifts, and scene instability. Additionally, hand movements are poorly modeled, resulting in noticeable distortions and misalignment with the audio. In this work, we propose InfinityHuman, a coarse-to-fine framework that first generates audio-synchronized representations, then progressively refines them into high-resolution, long-duration videos using a pose-guided refiner. Since pose sequences are decoupled from appearance and resist temporal degradation, our pose-guided refiner employs stable poses and the initial frame as a visual anchor to reduce drift and improve lip synchronization. Moreover, to enhance semantic accuracy and gesture realism, we introduce a hand-specific reward mechanism trained with high-quality hand motion data. Experiments on the EMTD and HDTF datasets show that InfinityHuman achieves state-of-the-art performance in video quality, identity preservation, hand accuracy, and lip-sync. Ablation studies further confirm the effectiveness of each module. Code will be made public.

Problem

Research questions and friction points this paper is trying to address.

Generating high-resolution long-duration human animation videos

Preventing identity drift and appearance degradation over time

Improving hand motion realism and audio synchronization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coarse-to-fine framework with pose-guided refinement

Decoupled pose sequences resist temporal degradation

Hand-specific reward mechanism enhances gesture realism

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones