EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the limitations of existing speech-driven talking avatar methods, which rely on dual audio streams and consequently produce non-causal listen-speak behaviors and incoherent interactions. The authors propose the first single-audio-stream framework capable of real-time, natural turn-taking through explicit listen-speak state control and a streaming audio scheduler. The core innovation integrates a rectified flow diffusion Transformer with a differentiable renderer, enabling high-fidelity avatar generation in just four sampling steps. A two-stage training strategy—comprising coefficient-space pretraining followed by image-domain joint fine-tuning—further enhances performance. The system achieves state-of-the-art visual quality and motion fidelity in both speaking and listening scenarios, supporting real-time, artifact-free lip synchronization and seamless audiovisual interaction.

Technology Category

Application Category

📝 Abstract

We present EmbodiedHead, a speech-driven talking-head framework that equips LLMs with real-time visual avatars for conversation. A practical embodied avatar must achieve real-time generation, unified listening-speaking behavior, and high rendered visual quality simultaneously. Our framework couples the first Rectified-Flow Diffusion Transformer (DiT) for this task with a differentiable renderer, enabling diverse, high-fidelity generation in as few as four sampling steps. Prior listening-speaking methods rely on dual-stream audio, introducing an interlocutor look-ahead dependency incompatible with causal user--LLM interaction. We instead adopt a single-stream interface with explicit per-frame listening-speaking state conditioning and a Streaming Audio Scheduler, suppressing spurious mouth motion during listening while enabling seamless turn-taking. A two-stage training scheme of coefficient-space pretraining and joint image-domain refinement further closes the gap between motion-level supervision and rendered quality. Extensive experiments demonstrate state-of-the-art visual quality and motion fidelity in both speaking and listening scenarios.

Problem

Research questions and friction points this paper is trying to address.

talking avatar

real-time generation

listening-speaking behavior

visual quality

conversational agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectified-Flow Diffusion Transformer

differentiable renderer

single-stream audio interface