Audio-Assisted Face Video Restoration with Temporal and Identity Complementary Learning

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing facial video restoration methods largely neglect cross-modal correlations between audio and video—particularly around the mouth region—and most audio-assisted approaches target only compression artifacts, failing to address multiple degradations (e.g., compression, motion blur, and low resolution) common in streaming scenarios. To address this, we propose the Generalized Audio-assisted Video Restoration Network (GAVN), the first framework capable of jointly restoring videos degraded by multiple distortions. GAVN introduces a novel cross-modal temporal and identity feature complementarity learning mechanism: it models inter-frame temporal dynamics in the low-resolution space, while fusing audio semantics and landmark-guided identity features in the high-resolution space, followed by detail reconstruction via a multimodal fusion module. Experiments demonstrate that GAVN outperforms state-of-the-art methods on compression artifact removal, deblurring, and super-resolution tasks, achieving both high efficiency and strong robustness across diverse degradation types.

Technology Category

Application Category

📝 Abstract
Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.
Problem

Research questions and friction points this paper is trying to address.

Restores degraded face videos using audio-visual correlations
Addresses multiple distortions like compression, blurring, low resolution
Enhances facial details via identity-temporal complementary learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses audio signals for facial detail restoration
Combines temporal and identity complementary learning
Integrates low and high resolution feature processing
🔎 Similar Papers
No similar papers found.
Yuqin Cao
Yuqin Cao
Shanghai Jiao Tong University
Y
Yixuan Gao
Shanghai Jiao Tong University
W
Wei Sun
East China Normal University
X
Xiaohong Liu
Shanghai Jiao Tong University
Y
Yulun Zhang
Shanghai Jiao Tong University
X
Xiongkuo Min
Shanghai Jiao Tong University