StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address key challenges in real-time streaming interactive digital human generation—including the non-causal nature, high latency, and limited body motion control of diffusion models—this paper proposes an end-to-end two-stage autoregressive distillation and adversarial refinement framework. We innovatively introduce the Reference Sink mechanism, RAPR (Reference-Aware Positional Re-encoding), and a consistency-aware discriminator to significantly enhance long-horizon generation stability and inter-frame motion coherence. Built upon video diffusion models, our approach integrates autoregressive adaptation, knowledge distillation, and conditional adversarial training, enabling, for the first time, full-body co-generative synthesis—including natural speaking, listening, and gesturing—from a single reference sample. The system achieves real-time inference at >30 FPS. Quantitative and qualitative evaluations demonstrate state-of-the-art performance across generation quality, interaction naturalness, and end-to-end latency.

Technology Category

Application Category

📝 Abstract
Real-time, streaming interactive avatars represent a critical yet challenging goal in digital human research. Although diffusion-based human avatar generation methods achieve remarkable success, their non-causal architecture and high computational costs make them unsuitable for streaming. Moreover, existing interactive approaches are typically limited to head-and-shoulder region, limiting their ability to produce gestures and body motions. To address these challenges, we propose a two-stage autoregressive adaptation and acceleration framework that applies autoregressive distillation and adversarial refinement to adapt a high-fidelity human video diffusion model for real-time, interactive streaming. To ensure long-term stability and consistency, we introduce three key components: a Reference Sink, a Reference-Anchored Positional Re-encoding (RAPR) strategy, and a Consistency-Aware Discriminator. Building on this framework, we develop a one-shot, interactive, human avatar model capable of generating both natural talking and listening behaviors with coherent gestures. Extensive experiments demonstrate that our method achieves state-of-the-art performance, surpassing existing approaches in generation quality, real-time efficiency, and interaction naturalness. Project page: https://streamavatar.github.io .
Problem

Research questions and friction points this paper is trying to address.

Adapt diffusion models for real-time interactive avatars
Enable full-body gestures beyond head-and-shoulder limitations
Ensure long-term stability and consistency in streaming
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive distillation for real-time streaming
Reference-anchored encoding ensures long-term consistency
One-shot interactive avatar with talking and listening gestures
🔎 Similar Papers
No similar papers found.