TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

237K/year
🤖 AI Summary
Existing audio-driven talking-head video generation methods rely on multi-step denoising, incurring substantial computational overhead, while single-step distillation suffers from training instability and compromised output quality. To address these limitations, this work proposes a two-stage progressive distillation framework. First, a stable four-step student model is obtained via distribution-matching distillation; subsequently, adversarial distillation combined with a progressive timesteps sampling strategy gradually compresses the model to single-step generation. The approach innovatively introduces a self-comparative adversarial objective, which enhances training stability under extreme step compression. This method achieves real-time, single-step inference—accelerating generation by 120×—while preserving high visual fidelity.

Technology Category

Application Category

📝 Abstract
Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.
Problem

Research questions and friction points this paper is trying to address.

audio-driven talking avatar
one-step generation
training instability
diffusion model distillation
computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive distillation
one-step generation
audio-driven talking avatar
adversarial distillation
distribution matching
🔎 Similar Papers
No similar papers found.