SoulX-LiveTalk Technical Report

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-time, infinite-duration audio-driven digital human generation confronts a fundamental trade-off between the high computational demands of large-scale diffusion models and stringent millisecond-level latency constraints. Existing approaches often compromise bidirectional attention or visual fidelity to improve efficiency. Method: We propose a self-correcting bidirectional distillation strategy and a multi-step retrospective self-correction mechanism to preserve full intra-video-block bidirectional modeling while ensuring long-sequence generation stability. Coupled with holistic inference acceleration—including hybrid sequence parallelism, parallel VAE decoding, and kernel-level optimizations—we achieve end-to-end real-time performance. Contribution/Results: Our system, built upon a 14B-parameter diffusion model, achieves a 0.87-second cold-start latency and 32 FPS end-to-end throughput. To our knowledge, it is the first deployment framework enabling real-time, high-fidelity, interactive, and infinitely extensible audio-to-video synthesis via diffusion models.

Technology Category

Application Category

📝 Abstract
Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce extbf{SoulX-LiveTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a extbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a extbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-LiveTalk is the first 14B-scale system to achieve a extbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of extbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.
Problem

Research questions and friction points this paper is trying to address.

Real-time audio-driven avatar generation with high fidelity
Resolving computational load versus latency constraints conflict
Ensuring stable infinite-duration generation without visual collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-correcting Bidirectional Distillation for spatiotemporal coherence
Multi-step Retrospective Self-Correction Mechanism for stability
Full-stack inference acceleration suite for real-time performance
🔎 Similar Papers
No similar papers found.
L
Le Shen
AIGC Team, Soul AI Lab, China; Donghua University
Q
Qiao Qian
AIGC Team, Soul AI Lab, China
Tan Yu
Tan Yu
NVIDIA
LLMRAGCross-modal searchadvertisingvision backbone
K
Ke Zhou
AIGC Team, Soul AI Lab, China
T
Tianhang Yu
AIGC Team, Soul AI Lab, China
Yu Zhan
Yu Zhan
Southern University of Science and Technology
robot person followinghuman pose estimationomnidirectional image
Z
Zhenjie Wang
Donghua University
M
Ming Tao
AIGC Team, Soul AI Lab, China
S
Shunshun Yin
AIGC Team, Soul AI Lab, China
S
Siyuan Liu
AIGC Team, Soul AI Lab, China