TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing audio-driven talking-head video generation methods rely on multi-step denoising, incurring substantial computational overhead, while single-step distillation suffers from training instability and compromised output quality. To address these limitations, this work proposes a two-stage progressive distillation framework. First, a stable four-step student model is obtained via distribution-matching distillation; subsequently, adversarial distillation combined with a progressive timesteps sampling strategy gradually compresses the model to single-step generation. The approach innovatively introduces a self-comparative adversarial objective, which enhances training stability under extreme step compression. This method achieves real-time, single-step inference—accelerating generation by 120×—while preserving high visual fidelity.

Technology Category

Application Category

📝 Abstract

Existing audio-driven video digital human generation models rely on multi-step denoising, resulting in substantial computational overhead that severely limits their deployment in real-world settings. While one-step distillation approaches can significantly accelerate inference, they often suffer from training instability. To address this challenge, we propose TurboTalk, a two-stage progressive distillation framework that effectively compresses a multi-step audio-driven video diffusion model into a single-step generator. We first adopt Distribution Matching Distillation to obtain a strong and stable 4-step student, and then progressively reduce the denoising steps from 4 to 1 through adversarial distillation. To ensure stable training under extreme step reduction, we introduce a progressive timestep sampling strategy and a self-compare adversarial objective that provides an intermediate adversarial reference that stabilizes progressive distillation. Our method achieve single-step generation of video talking avatar, boosting inference speed by 120 times while maintaining high generation quality.

Problem

Research questions and friction points this paper is trying to address.

audio-driven talking avatar

one-step generation

training instability

diffusion model distillation

computational overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

progressive distillation

one-step generation

audio-driven talking avatar