🤖 AI Summary
Existing facial video restoration methods largely neglect cross-modal correlations between audio and video—particularly around the mouth region—and most audio-assisted approaches target only compression artifacts, failing to address multiple degradations (e.g., compression, motion blur, and low resolution) common in streaming scenarios. To address this, we propose the Generalized Audio-assisted Video Restoration Network (GAVN), the first framework capable of jointly restoring videos degraded by multiple distortions. GAVN introduces a novel cross-modal temporal and identity feature complementarity learning mechanism: it models inter-frame temporal dynamics in the low-resolution space, while fusing audio semantics and landmark-guided identity features in the high-resolution space, followed by detail reconstruction via a multimodal fusion module. Experiments demonstrate that GAVN outperforms state-of-the-art methods on compression artifact removal, deblurring, and super-resolution tasks, achieving both high efficiency and strong robustness across diverse degradation types.
📝 Abstract
Face videos accompanied by audio have become integral to our daily lives, while they often suffer from complex degradations. Most face video restoration methods neglect the intrinsic correlations between the visual and audio features, especially in mouth regions. A few audio-aided face video restoration methods have been proposed, but they only focus on compression artifact removal. In this paper, we propose a General Audio-assisted face Video restoration Network (GAVN) to address various types of streaming video distortions via identity and temporal complementary learning. Specifically, GAVN first captures inter-frame temporal features in the low-resolution space to restore frames coarsely and save computational cost. Then, GAVN extracts intra-frame identity features in the high-resolution space with the assistance of audio signals and face landmarks to restore more facial details. Finally, the reconstruction module integrates temporal features and identity features to generate high-quality face videos. Experimental results demonstrate that GAVN outperforms the existing state-of-the-art methods on face video compression artifact removal, deblurring, and super-resolution. Codes will be released upon publication.