UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the scarcity of open-source, reproducible frameworks for high-fidelity, lip-synced talking-head generation by proposing an end-to-end unified diffusion model. The framework introduces, for the first time within a diffusion architecture, a shared self-attention mechanism that explicitly models fine-grained temporal alignment between audio and video tokens in a latent space via a multimodal Transformer. It further integrates pretrained video generation priors with personalized voice cloning capabilities. Evaluated against existing open-source methods, the proposed approach achieves superior performance in lip-sync accuracy, speech naturalness, and overall perceptual quality, enabling high-fidelity, identity-consistent talking-head synthesis from only a short reference audio clip.

Technology Category

Application Category

📝 Abstract

While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.

Problem

Research questions and friction points this paper is trying to address.

talking portrait generation

audio-video synchronization

closed-source models

lip-sync accuracy

open-source framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Audio-Video Framework

Multi-Modal Transformer

End-to-End Diffusion