🤖 AI Summary
Existing methods for audio-driven talking head generation suffer from low causal efficiency, poor temporal consistency, and long-term generation drift, hindering their applicability in real-time scenarios. This work proposes AsymK-Talker, a three-stage framework comprising kernel-conditioned recurrent generation, temporal reference encoding, and asymmetric kernel distillation. It introduces, for the first time, an asymmetric kernel distillation mechanism that integrates motion-kernel-guided causal chunking with temporally aware identity encoding. This design achieves high-quality, low-latency, and temporally stable generation over extended sequences while preserving strong lip-sync accuracy. Experimental results demonstrate that the proposed method excels in both visual fidelity and synchronization metrics, enabling robust, real-time synthesis of photorealistic talking heads.
📝 Abstract
Recent advances in diffusion models have markedly enhanced the visual fidelity of audio-driven talking head generation. Nevertheless, existing methods are constrained by three critical limitations: causal inefficiency that impedes real-time inference, incompatibility with temporally coherent conditioning, and progressive drift over long-horizon generation, collectively hindering their deployment in real-time applications. To overcome these challenges, we introduce AsymK-Talker, a novel diffusion-distillation method designed for real-time and long-horizon talking head generation. AsymK-Talker comprises three key components: (1) Kernel-Conditioned Loop Generation (KCLG), a causal, chunk-wise generation paradigm that leverages motion kernels to enable temporally consistent propagation; (2) Temporal Reference Encoding (TRE), which converts a static identity reference into a time-aware latent representation to enhance audio-visual synchronization; and (3) Asymmetric Kernel Distillation (AKD), a teacher-student distillation framework wherein the teacher model conditions on ground-truth motion kernels for supervision, while the student learns to generate from generated kernels, thereby ensuring robustness during extended generation sequences. AsymK-Talker achieves promising results on both visual fidelity and lip synchronization metrics.