JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

📅 2025-12-12

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing diffusion-based audio-driven talking-head generation methods—particularly DiT variants—are constrained by high computational overhead and limited video duration. While autoregressive approaches extend generation length, they suffer from error accumulation and progressive quality degradation. This paper proposes a causal diffusion architecture (1.3B parameters) integrating Progressive Step Bootstrapping (PSB), Motion-Conditioned Injection (MCI), and Unbounded RoPE with Cache Reset (URCR) to respectively enhance first-frame stability, temporal coherence, and maximum modeling length. Coupled with noise-frame conditional modeling and GPU-optimized inference, our method achieves real-time generation (16 FPS on a single GPU) with theoretically unbounded duration. Quantitative and qualitative evaluations demonstrate state-of-the-art performance in visual fidelity, motion smoothness, and lip-sync accuracy.

Technology Category

Application Category

📝 Abstract

Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.

Problem

Research questions and friction points this paper is trying to address.

Real-time audio-driven avatar generation

Infinite-length video synthesis

Reducing error accumulation in autoregressive diffusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Step Bootstrapping stabilizes generation

Motion Condition Injection enhances temporal coherence

Unbounded RoPE enables infinite-length generation

🔎 Similar Papers

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency