LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation

📅 2025-12-29

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address high latency, visual artifacts (e.g., flickering, black frames), and quality degradation in real-time interactive multimodal (text/image/speech) video generation, this paper proposes an improved on-policy policy distillation framework. Methodologically, it integrates multimodal conditional modeling, an Anchor-Heavy Identity Sinks mechanism for long-video inference, and a co-designed audio–language model architecture to jointly optimize conditional input fidelity, policy initialization, and scheduling. It is the first work to systematically eliminate visual artifacts in multimodal video diffusion while preserving both high fidelity and low latency. Experiments demonstrate that our approach matches the full-step bidirectional diffusion baseline’s visual quality on HDTF, AVSpeech, and CelebV-HQ, reducing inference cost by 20×. Deployed in the LiveTalk system, it achieves end-to-end sub-second response latency, with superior multi-turn interaction coherence and content quality compared to Sora2 and Veo3.

Technology Category

Application Category

📝 Abstract

Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems. However, the simultaneous denoising of all video frames with bidirectional attention via an iterative process in diffusion models prevents real-time interaction. While existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this, they focus primarily on text-to-video generation, leaving the human-AI interaction unnatural and less efficient. This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap. Given the observation that the leading on-policy distillation approach Self Forcing encounters challenges (visual artifacts like flickering, black frames, and quality degradation) with multimodal conditioning, we investigate an improved distillation recipe with emphasis on the quality of condition inputs as well as the initialization and schedule for the on-policy optimization. On benchmarks for multimodal-conditioned (audio, image, and text) avatar video generation including HDTF, AVSpeech, and CelebV-HQ, our distilled model matches the visual quality of the full-step, bidirectional baselines of similar or larger size with 20x less inference cost and latency. Further, we integrate our model with audio language models and long-form video inference technique Anchor-Heavy Identity Sinks to build LiveTalk, a real-time multimodal interactive avatar system. System-level evaluation on our curated multi-turn interaction benchmark shows LiveTalk outperforms state-of-the-art models (Sora2, Veo3) in multi-turn video coherence and content quality, while reducing response latency from 1 to 2 minutes to real-time generation, enabling seamless human-AI multimodal interaction.

Problem

Research questions and friction points this paper is trying to address.

Real-time video generation for multimodal interactive AI systems

Overcoming visual artifacts in on-policy distillation for multimodal conditioning

Reducing inference latency from minutes to real-time for human-AI interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved on-policy distillation for real-time video diffusion

Multimodal conditioning with text, image, and audio inputs

Integration with audio language models for interactive avatars

🔎 Similar Papers

No similar papers found.