FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

📅 2024-12-22
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address three key challenges in audio-driven talking avatar synthesis—slow diffusion-model inference, poor robustness on open-domain images, and weak audio-video alignment—this paper proposes a multi-Classifier-Free Guidance (CFG) distillation framework leveraging hybrid supervision and learnable cross-modal condition tokens. Our method introduces the first hybrid supervision scheme integrating reconstruction loss, adversarial loss, and explicit audio-video synchronization constraints; it further employs learnable tokens to explicitly model inter-modal conditional relationships, enabling compression of conventional three-stage CFG inference into a single forward pass. Evaluated on multiple benchmarks, our approach achieves 4.17–12.5× reduction in function evaluations (NFEs), matches state-of-the-art diffusion models in generation quality, and significantly improves generalization to open-domain images and lip-sync accuracy—reducing lip-sync error (LSE) by 18.3% and increasing SyncNet score by 2.1.

Technology Category

Application Category

📝 Abstract
Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times. Demos are available at our webpage http://fadavatar.github.io.
Problem

Research questions and friction points this paper is trying to address.

Slow inference speed in diffusion-based avatar synthesis
Reduced robustness with open-set input images
Decreased audio-video correlation in distilled models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-supervised loss enhances model robustness
Multi-CFG distillation reduces inference runs
Learnable tokens optimize audio-image correlation
🔎 Similar Papers
No similar papers found.
T
Tianyun Zhong
Zhejiang University
C
Chao Liang
ByteDance
J
Jianwen Jiang
ByteDance
Gaojie Lin
Gaojie Lin
Bytedance
J
Jiaqi Yang
ByteDance
Zhou Zhao
Zhou Zhao
Zhejiang University
Machine LearningData MiningMultimedia Computing