EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

📅 2026-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven talking head generation methods suffer from limited expressiveness, coarse control granularity, and high latency due to their reliance on intermediate facial representations or patch-based diffusion strategies. This work proposes an end-to-end, GPT-style autoregressive model that synthesizes variable-length videos through a frame-by-frame, context-aware streaming paradigm, ensuring identity consistency and enabling fine-grained interactive control. Key innovations include the Sink Frame Window Attention mechanism, which maintains identity coherence over long sequences, and a streaming Frame Condition In-Context scheme that allows dynamic injection of diverse control signals at arbitrary time steps. Experiments demonstrate that the proposed method achieves generation quality comparable to diffusion models while significantly reducing latency and outperforming existing autoregressive approaches.

Technology Category

Application Category

📝 Abstract
Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.
Problem

Research questions and friction points this paper is trying to address.

talking head synthesis
audio-driven generation
autoregressive modeling
diffusion models
fine-grained control
Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregressive talking head
streaming generation
frame-wise control
in-context learning
Sink Frame Window Attention
🔎 Similar Papers
No similar papers found.
Y
Yuzhe Weng
University of Science and Technology of China
H
Haotian Wang
University of Science and Technology of China
Y
Yuanhong Yu
Zhejiang University
Jun Du
Jun Du
Professor, NERC-SLIP, USTC
Speech Signal ProcessingAudio Signal ProcessingPattern Recognition
S
Shan He
iFLYTEK
X
Xiaoyan Wu
iFLYTEK
H
Haoran Xu
iFLYTEK