FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis

📅 2025-03-06

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

In audio-driven talking-head synthesis, lip-sync inaccuracies and high-frequency detail loss remain critical challenges; existing pixel-domain approaches suffer from insufficient spectral fidelity. This paper introduces, for the first time, a frequency-domain modeling paradigm. We propose a Visual Encoding Frequency Modulator (VEFM) and an Audio-Visual Frequency Modulator (AVFM) to enable precise cross-modal coupling of audio and visual features in the frequency domain. Our framework further incorporates multi-scale spectral coupling, joint pixel-frequency domain optimization, and a lightweight temporal modeling module. It supports seamless switching between single-image driving and video dubbing, enabling real-time 1080p inference. Experiments demonstrate a 12.6% improvement in lip-sync accuracy and consistent superiority over state-of-the-art methods across LSE, SyncNet, and LPIPS metrics, alongside significantly enhanced spectral fidelity.

Technology Category

Application Category

📝 Abstract

Achieving high-fidelity lip-speech synchronization in audio-driven talking portrait synthesis remains challenging. While multi-stage pipelines or diffusion models yield high-quality results, they suffer from high computational costs. Some approaches perform well on specific individuals with low resources, yet still exhibit mismatched lip movements. The aforementioned methods are modeled in the pixel domain. We observed that there are noticeable discrepancies in the frequency domain between the synthesized talking videos and natural videos. Currently, no research on talking portrait synthesis has considered this aspect. To address this, we propose a FREquency-modulated, high-fidelity, and real-time Audio-driven talKing portrait synthesis framework, named FREAK, which models talking portraits from the frequency domain perspective, enhancing the fidelity and naturalness of the synthesized portraits. FREAK introduces two novel frequency-based modules: 1) the Visual Encoding Frequency Modulator (VEFM) to couple multi-scale visual features in the frequency domain, better preserving visual frequency information and reducing the gap in the frequency spectrum between synthesized and natural frames. and 2) the Audio Visual Frequency Modulator (AVFM) to help the model learn the talking pattern in the frequency domain and improve audio-visual synchronization. Additionally, we optimize the model in both pixel domain and frequency domain jointly. Furthermore, FREAK supports seamless switching between one-shot and video dubbing settings, offering enhanced flexibility. Due to its superior performance, it can simultaneously support high-resolution video results and real-time inference. Extensive experiments demonstrate that our method synthesizes high-fidelity talking portraits with detailed facial textures and precise lip synchronization in real-time, outperforming state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity lip-speech synchronization in audio-driven talking portraits.

Reducing computational costs while maintaining high-quality synthesis results.

Addressing frequency domain discrepancies between synthesized and natural videos.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Frequency domain modeling for talking portraits

Visual Encoding Frequency Modulator (VEFM) integration

Audio Visual Frequency Modulator (AVFM) for synchronization

🔎 Similar Papers

No similar papers found.