Audio-Visual Driven Compression for Low-Bitrate Talking Head Videos

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in speaker head video compression at low bitrates—including difficulty modeling large-angle head motions, severe audio-visual desynchronization (especially lip movement), and facial reconstruction artifacts—this paper proposes the first audio-visual co-driven neural codec framework. Methodologically, it introduces a neural rendering-based 3D keypoint motion representation, designs a cross-modal audio-visual feature fusion mechanism, and enables end-to-end differentiable encoding and decoding. Its key innovation lies in jointly driving head dynamics using compact 3D motion features and raw audio signals, significantly improving robustness to large rotations and lip motion alignment accuracy. Evaluated on CelebV-HQ, the method achieves a 22% bitrate reduction over VVC and an 8.5% reduction over state-of-the-art learned codecs, while improving lip synchronization error (LSE) by 31% and mean opinion score (MOS) by 0.8.

Technology Category

Application Category

📝 Abstract
Talking head video compression has advanced with neural rendering and keypoint-based methods, but challenges remain, especially at low bit rates, including handling large head movements, suboptimal lip synchronization, and distorted facial reconstructions. To address these problems, we propose a novel audio-visual driven video codec that integrates compact 3D motion features and audio signals. This approach robustly models significant head rotations and aligns lip movements with speech, improving both compression efficiency and reconstruction quality. Experiments on the CelebV-HQ dataset show that our method reduces bitrate by 22% compared to VVC and by 8.5% over state-of-the-art learning-based codec. Furthermore, it provides superior lip-sync accuracy and visual fidelity at comparable bitrates, highlighting its effectiveness in bandwidth-constrained scenarios.
Problem

Research questions and friction points this paper is trying to address.

Handling large head movements in low-bitrate videos
Improving lip synchronization with speech audio
Reducing facial distortions in compressed videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 3D motion features and audio signals
Improves compression efficiency and reconstruction quality
Reduces bitrate significantly compared to VVC
R
Riku Takahashi
Hosei University, Tokyo, Japan
Ryugo Morita
Ryugo Morita
Hosei University
AIComputer VisonImage/Video GenerationGANsDiffusion Models
J
Jinjia Zhou
Hosei University, Tokyo, Japan