Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

πŸ“… 2026-04-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

217K/year
πŸ€– AI Summary
Existing joint audio-visual generation methods couple modalities throughout the denoising process, entangling high-level semantics with low-level details and thereby compromising the efficiency and quality of talking-head synthesis. To address this, this work proposes an autoregressive diffusion framework that decouples semantic modeling from rendering: a shared backbone captures high-level cross-modal semantics, while modality-specific lightweight diffusion Transformer decoders independently refine low-level audio and visual details. By unifying audio and video processing at the patch-level token representation, the proposed approach significantly outperforms dual-branch baselines on standard talking-head benchmarks, achieving state-of-the-art performance in lip-sync accuracy, video quality, and audio fidelity.

Technology Category

Application Category

πŸ“ Abstract
Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.
Problem

Research questions and friction points this paper is trying to address.

talking head synthesis
audio-video generation
cross-modal coherence
modality entanglement
diffusion modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregressive diffusion
cross-modal disentanglement
talking head synthesis
modality-specific decoding
unified token space
πŸ”Ž Similar Papers
No similar papers found.