EARTalking: End-to-end GPT-style Autoregressive Talking Head Synthesis with Frame-wise Control

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing audio-driven talking head generation methods suffer from limited expressiveness, coarse control granularity, and high latency due to their reliance on intermediate facial representations or patch-based diffusion strategies. This work proposes an end-to-end, GPT-style autoregressive model that synthesizes variable-length videos through a frame-by-frame, context-aware streaming paradigm, ensuring identity consistency and enabling fine-grained interactive control. Key innovations include the Sink Frame Window Attention mechanism, which maintains identity coherence over long sequences, and a streaming Frame Condition In-Context scheme that allows dynamic injection of diverse control signals at arbitrary time steps. Experiments demonstrate that the proposed method achieves generation quality comparable to diffusion models while significantly reducing latency and outperforming existing autoregressive approaches.

Technology Category

Application Category

📝 Abstract

Audio-driven talking head generation aims to create vivid and realistic videos from a static portrait and speech. Existing AR-based methods rely on intermediate facial representations, which limit their expressiveness and realism. Meanwhile, diffusion-based methods generate clip-by-clip, lacking fine-grained control and causing inherent latency due to overall denoising across the window. To address these limitations, we propose EARTalking, a novel end-to-end, GPT-style autoregressive model for interactive audio-driven talking head generation. Our method introduces a novel frame-by-frame, in-context, audio-driven streaming generation paradigm. For inherently supporting variable-length video generation with identity consistency, we propose the Sink Frame Window Attention (SFA) mechanism. Furthermore, to avoid the complex, separate networks that prior works required for diverse control signals, we propose a streaming Frame Condition In-Context (FCIC) scheme. This scheme efficiently injects diverse control signals in a streaming, in-context manner, enabling interactive control at every frame and at arbitrary moments. Experiments demonstrate that EARTalking outperforms existing autoregressive methods and achieves performance comparable to diffusion-based methods. Our work demonstrates the feasibility of in-context streaming autoregressive control, unlocking a scalable direction for flexible, efficient generation. The code will be released for reproducibility.

Problem

Research questions and friction points this paper is trying to address.

talking head synthesis

audio-driven generation

autoregressive modeling

diffusion models

fine-grained control

Innovation

Methods, ideas, or system contributions that make the work stand out.

autoregressive talking head

streaming generation

frame-wise control