Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching

๐Ÿ“… 2025-07-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address inaccurate lip synchronization and long-term pose drift in real-time audio-driven talking-head video generation, this paper proposes a real-time-optimized tailored flow matching framework. Methodologically, it integrates audio-feature conditioning, explicit pose modeling, and efficient inference optimization to enable end-to-end low-latency sequence generation. Its key contribution lies in a lightweight flow matching architecture that significantly improves visual naturalness and temporal stability while preserving high frame-level temporal coherence. Experiments on the HDTF dataset demonstrate a LipSync Confidence score of 8.50, an inference throughput of 141 FPS on a single A10 GPU, and an end-to-end latency of only 0.17 secondsโ€”enabling high-fidelity virtual avatar deployment across diverse real-time scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality with a 8.50 LipSync Confidence on the HDTF dataset, and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to broader applications. Our project is available at https://www.hedra.com/ with with examples at https://h-liu1997.github.io/Livatar-1/
Problem

Research questions and friction points this paper is trying to address.

Improving lip-sync accuracy in talking heads generation
Reducing long-term pose drift in video synthesis
Enabling real-time high-fidelity avatar generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow matching for real-time talking heads
Optimized system achieves 141 FPS
High lip-sync accuracy with 8.50 score
๐Ÿ”Ž Similar Papers
No similar papers found.
Haiyang Liu
Haiyang Liu
The University of Tokyo
Human Video GenerationMotion GenerationMulti-Modal Understanding and Generation
X
Xiaolin Hong
Hedra Inc.
X
Xuancheng Yang
Hedra Inc.
Y
Yudi Ruan
Hedra Inc.
X
Xiang Lian
Hedra Inc.
M
Michael Lingelbach
Hedra Inc.
Hongwei Yi
Hongwei Yi
Max Planck Institute for Intelligent Systems;
W
Wei Li
Hedra Inc.