Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

πŸ“… 2025-12-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing diffusion models struggle with real-time, infinite-length audio-driven virtual avatar generation due to autoregressive computation and poor long-range temporal consistency. To address this, we propose Timestep-forcing Pipeline Parallelism (TPP) and the Rolling Sink Frame Mechanism (RSFM), establishing the first causal streaming generation framework tailored for large-scale diffusion models. Our approach integrates distributed inference, dynamic caching and calibration of reference frames, and self-forced distribution-matching distillation. Implemented on a 14B-parameter model, it enables low-latency streaming synthesis. On a 5Γ—H800 GPU cluster, our system achieves end-to-end real-time inference at 20 FPSβ€”marking the first demonstration of high-fidelity, infinitely extendable, audio-driven virtual avatar streaming. This breakthrough significantly alleviates long-video generation bottlenecks in both temporal coherence and computational efficiency, attaining industrial-grade deployment readiness.

Technology Category

Application Category

πŸ“ Abstract
Existing diffusion-based video generation methods are fundamentally constrained by sequential computation and long-horizon inconsistency, limiting their practical adoption in real-time, streaming audio-driven avatar synthesis. We present Live Avatar, an algorithm-system co-designed framework that enables efficient, high-fidelity, and infinite-length avatar generation using a 14-billion-parameter diffusion model. Our approach introduces Timestep-forcing Pipeline Parallelism (TPP), a distributed inference paradigm that pipelines denoising steps across multiple GPUs, effectively breaking the autoregressive bottleneck and ensuring stable, low-latency real-time streaming. To further enhance temporal consistency and mitigate identity drift and color artifacts, we propose the Rolling Sink Frame Mechanism (RSFM), which maintains sequence fidelity by dynamically recalibrating appearance using a cached reference image. Additionally, we leverage Self-Forcing Distribution Matching Distillation to facilitate causal, streamable adaptation of large-scale models without sacrificing visual quality. Live Avatar demonstrates state-of-the-art performance, reaching 20 FPS end-to-end generation on 5 H800 GPUs, and, to the best of our knowledge, is the first to achieve practical, real-time, high-fidelity avatar generation at this scale. Our work establishes a new paradigm for deploying advanced diffusion models in industrial long-form video synthesis applications.
Problem

Research questions and friction points this paper is trying to address.

Real-time streaming avatar generation with infinite length
Overcoming sequential computation and long-horizon inconsistency
Achieving high-fidelity, low-latency avatar synthesis at scale
Innovation

Methods, ideas, or system contributions that make the work stand out.

TPP pipelines denoising steps across multiple GPUs
RSFM maintains sequence fidelity with cached reference image
Self-Forcing Distribution Matching enables streamable model adaptation
πŸ”Ž Similar Papers
No similar papers found.