StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation

📅 2025-08-11

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing audio-driven talking-head video generation methods suffer from three key limitations: difficulty in synthesizing long-duration videos, poor audio-visual synchronization, and weak identity consistency. This paper proposes an end-to-end video diffusion Transformer framework that takes a reference image and raw speech audio as input to generate arbitrarily long, high-fidelity videos. Our core contributions are: (1) a timestep-aware audio adapter that mitigates error accumulation in the latent space; (2) an audio-native guidance mechanism that enforces fine-grained frame-level speech-motion alignment; and (3) a dynamic weighted sliding-window strategy that ensures long-term temporal coherence. By synergistically integrating timestep modulation, enhanced cross-attention, and dynamic latent fusion, our method achieves significant improvements over state-of-the-art approaches across multiple benchmarks. Quantitative and qualitative evaluations demonstrate superior synchronization accuracy and robust identity preservation throughout extended sequences.

Technology Category

Application Category

📝 Abstract

Current diffusion models for audio-driven avatar video generation struggle to synthesize long videos with natural audio synchronization and identity consistency. This paper presents StableAvatar, the first end-to-end video diffusion transformer that synthesizes infinite-length high-quality videos without post-processing. Conditioned on a reference image and audio, StableAvatar integrates tailored training and inference modules to enable infinite-length video generation. We observe that the main reason preventing existing models from generating long videos lies in their audio modeling. They typically rely on third-party off-the-shelf extractors to obtain audio embeddings, which are then directly injected into the diffusion model via cross-attention. Since current diffusion backbones lack any audio-related priors, this approach causes severe latent distribution error accumulation across video clips, leading the latent distribution of subsequent segments to drift away from the optimal distribution gradually. To address this, StableAvatar introduces a novel Time-step-aware Audio Adapter that prevents error accumulation via time-step-aware modulation. During inference, we propose a novel Audio Native Guidance Mechanism to further enhance the audio synchronization by leveraging the diffusion's own evolving joint audio-latent prediction as a dynamic guidance signal. To enhance the smoothness of the infinite-length videos, we introduce a Dynamic Weighted Sliding-window Strategy that fuses latent over time. Experiments on benchmarks show the effectiveness of StableAvatar both qualitatively and quantitatively.

Problem

Research questions and friction points this paper is trying to address.

Generates infinite-length avatar videos with audio sync

Prevents latent distribution error in long videos

Enhances video smoothness via dynamic fusion strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end video diffusion transformer for infinite-length generation

Time-step-aware Audio Adapter prevents error accumulation

Dynamic Weighted Sliding-window Strategy enhances video smoothness

🔎 Similar Papers

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency