DicFace: Dirichlet-Constrained Variational Codebook Learning for Temporally Coherent Video Face Restoration

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video face restoration faces a fundamental trade-off between temporal consistency and fine-grained detail recovery. To address this, we propose the first VQ-VAE-based extension framework tailored for video restoration. Its core innovation lies in reformulating the discrete codebook as continuous latent variables modeled by Dirichlet distributions, enabling probabilistic, smooth inter-frame facial feature transitions and substantially suppressing flickering artifacts. The method integrates variational latent-space modeling, a spatiotemporal Transformer architecture, Laplacian pyramid reconstruction loss, and LPIPS-based perceptual regularization. Evaluated on three challenging tasks—blind restoration, video completion, and face colorization—our approach achieves state-of-the-art performance. Quantitative and qualitative results demonstrate significant improvements in both temporal stability (e.g., reduced flicker and jitter) and visual fidelity (e.g., sharper textures and more natural appearance), establishing new benchmarks for coherent, high-fidelity video face restoration.

Technology Category

Application Category

📝 Abstract
Video face restoration faces a critical challenge in maintaining temporal consistency while recovering fine facial details from degraded inputs. This paper presents a novel approach that extends Vector-Quantized Variational Autoencoders (VQ-VAEs), pretrained on static high-quality portraits, into a video restoration framework through variational latent space modeling. Our key innovation lies in reformulating discrete codebook representations as Dirichlet-distributed continuous variables, enabling probabilistic transitions between facial features across frames. A spatio-temporal Transformer architecture jointly models inter-frame dependencies and predicts latent distributions, while a Laplacian-constrained reconstruction loss combined with perceptual (LPIPS) regularization enhances both pixel accuracy and visual quality. Comprehensive evaluations on blind face restoration, video inpainting, and facial colorization tasks demonstrate state-of-the-art performance. This work establishes an effective paradigm for adapting intensive image priors, pretrained on high-quality images, to video restoration while addressing the critical challenge of flicker artifacts. The source code has been open-sourced and is available at https://github.com/fudan-generative-vision/DicFace.
Problem

Research questions and friction points this paper is trying to address.

Maintaining temporal consistency in video face restoration
Recovering fine facial details from degraded inputs
Addressing flicker artifacts in video restoration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dirichlet-distributed continuous variables for transitions
Spatio-temporal Transformer for inter-frame dependencies
Laplacian-constrained loss with LPIPS regularization
🔎 Similar Papers
No similar papers found.