SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven talking-head generation methods struggle to simultaneously achieve high fidelity, temporal stability, and real-time streaming capability. This work proposes SoulX-FlashHead, a unified 1.3B-parameter framework enabling infinite-duration, high-fidelity, and low-latency talking-head video synthesis. Key innovations include stream-aware spatiotemporal pretraining, a temporal audio context caching mechanism to enhance the robustness of audio features, and a novel oracle-guided bidirectional distillation strategy that effectively mitigates error accumulation and identity drift in autoregressive generation. The model achieves state-of-the-art performance on both HDTF and VFHQ benchmarks. Its Lite variant attains 96 FPS inference on a single RTX 4090 GPU, striking an optimal balance between visual quality and ultra-low-latency interactivity.

Technology Category

Application Category

📝 Abstract
Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.
Problem

Research questions and friction points this paper is trying to address.

audio-driven talking head
real-time streaming
temporal stability
identity drift
low-latency generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming-Aware Spatiotemporal Pre-training
Temporal Audio Context Cache
Oracle-Guided Bidirectional Distillation
Real-time Talking Head Generation
VividHead Dataset
🔎 Similar Papers
No similar papers found.
Tan Yu
Tan Yu
NVIDIA
LLMRAGCross-modal searchadvertisingvision backbone
Q
Qian Qiao
AIGC Team, Soul AI Lab, China
L
Le Shen
AIGC Team, Soul AI Lab, China
K
Ke Zhou
AIGC Team, Soul AI Lab, China
J
Jincheng Hu
AIGC Team, Soul AI Lab, China
D
Dian Sheng
AIGC Team, Soul AI Lab, China
B
Bo Hu
AIGC Team, Soul AI Lab, China
H
Haoming Qin
AIGC Team, Soul AI Lab, China
J
Jun Gao
AIGC Team, Soul AI Lab, China
C
Changhai Zhou
AIGC Team, Soul AI Lab, China
S
Shunshun Yin
AIGC Team, Soul AI Lab, China
S
Siyuan Liu
AIGC Team, Soul AI Lab, China