🤖 AI Summary
Existing diffusion-based audio-driven talking-head generation methods—particularly DiT variants—are constrained by high computational overhead and limited video duration. While autoregressive approaches extend generation length, they suffer from error accumulation and progressive quality degradation. This paper proposes a causal diffusion architecture (1.3B parameters) integrating Progressive Step Bootstrapping (PSB), Motion-Conditioned Injection (MCI), and Unbounded RoPE with Cache Reset (URCR) to respectively enhance first-frame stability, temporal coherence, and maximum modeling length. Coupled with noise-frame conditional modeling and GPU-optimized inference, our method achieves real-time generation (16 FPS on a single GPU) with theoretically unbounded duration. Quantitative and qualitative evaluations demonstrate state-of-the-art performance in visual fidelity, motion smoothness, and lip-sync accuracy.
📝 Abstract
Existing DiT-based audio-driven avatar generation methods have achieved considerable progress, yet their broader application is constrained by limitations such as high computational overhead and the inability to synthesize long-duration videos. Autoregressive methods address this problem by applying block-wise autoregressive diffusion methods. However, these methods suffer from the problem of error accumulation and quality degradation. To address this, we propose JoyAvatar, an audio-driven autoregressive model capable of real-time inference and infinite-length video generation with the following contributions: (1) Progressive Step Bootstrapping (PSB), which allocates more denoising steps to initial frames to stabilize generation and reduce error accumulation; (2) Motion Condition Injection (MCI), enhancing temporal coherence by injecting noise-corrupted previous frames as motion condition; and (3) Unbounded RoPE via Cache-Resetting (URCR), enabling infinite-length generation through dynamic positional encoding. Our 1.3B-parameter causal model achieves 16 FPS on a single GPU and achieves competitive results in visual quality, temporal consistency, and lip synchronization.