JoyAvatar: Real-time and Infinite Audio-Driven Avatar Generation with Autoregressive Diffusion

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based audio-driven talking-head generation methods—particularly DiT variants—are constrained by high computational overhead and limited video duration. While autoregressive approaches extend generation length, they suffer from error accumulation and progressive quality degradation. This paper proposes a causal diffusion architecture (1.3B parameters) integrating Progressive Step Bootstrapping (PSB), Motion-Conditioned Injection (MCI), and Unbounded RoPE with Cache Reset (URCR) to respectively enhance first-frame stability, temporal coherence, and maximum modeling length. Coupled with noise-frame conditional modeling and GPU-optimized inference, our method achieves real-time generation (16 FPS on a single GPU) with theoretically unbounded duration. Quantitative and qualitative evaluations demonstrate state-of-the-art performance in visual fidelity, motion smoothness, and lip-sync accuracy.

Technology Category

Application Category

📝 Abstract
Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.
Problem

Research questions and friction points this paper is trying to address.

Real-time audio-driven avatar generation
Infinite-length video synthesis
Reducing error accumulation in autoregressive diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Step Bootstrapping stabilizes generation
Motion Condition Injection enhances temporal coherence
Unbounded RoPE enables infinite-length generation
🔎 Similar Papers
No similar papers found.
C
Chaochao Li
JD Explore Academy
R
Ruikui Wang
JD Explore Academy
L
Liangbo Zhou
JD Explore Academy
J
Jinheng Feng
JD Explore Academy
H
Huaishao Luo
JD Explore Academy
H
Huan Zhang
JD Explore Academy
Youzheng Wu
Youzheng Wu
JD AI Research, JD.COM
Natural Language ProcessingDialogueSpeech RecognitionDeep Learning
X
Xiaodong He
JD Explore Academy