🤖 AI Summary
This work addresses the limitations of existing speech-driven talking avatar methods, which rely on dual audio streams and consequently produce non-causal listen-speak behaviors and incoherent interactions. The authors propose the first single-audio-stream framework capable of real-time, natural turn-taking through explicit listen-speak state control and a streaming audio scheduler. The core innovation integrates a rectified flow diffusion Transformer with a differentiable renderer, enabling high-fidelity avatar generation in just four sampling steps. A two-stage training strategy—comprising coefficient-space pretraining followed by image-domain joint fine-tuning—further enhances performance. The system achieves state-of-the-art visual quality and motion fidelity in both speaking and listening scenarios, supporting real-time, artifact-free lip synchronization and seamless audiovisual interaction.
📝 Abstract
We present EmbodiedHead, a speech-driven talking-head framework that equips LLMs with real-time visual avatars for conversation. A practical embodied avatar must achieve real-time generation, unified listening-speaking behavior, and high rendered visual quality simultaneously. Our framework couples the first Rectified-Flow Diffusion Transformer (DiT) for this task with a differentiable renderer, enabling diverse, high-fidelity generation in as few as four sampling steps. Prior listening-speaking methods rely on dual-stream audio, introducing an interlocutor look-ahead dependency incompatible with causal user--LLM interaction. We instead adopt a single-stream interface with explicit per-frame listening-speaking state conditioning and a Streaming Audio Scheduler, suppressing spurious mouth motion during listening while enabling seamless turn-taking. A two-stage training scheme of coefficient-space pretraining and joint image-domain refinement further closes the gap between motion-level supervision and rendered quality. Extensive experiments demonstrate state-of-the-art visual quality and motion fidelity in both speaking and listening scenarios.