🤖 AI Summary
This work addresses the limitation of existing methods that generate isolated “talking heads” and fail to model spatial interaction in realistic two-person dialogues. The authors propose a dual-stream generative architecture that disentangles and synchronously synthesizes 3D facial animations for both interlocutors from mixed audio, leveraging speaker embeddings and cross-speaker attention mechanisms. For the first time, dynamic 3D spatial relationships—including relative positions, orientations, and mutual gaze—are explicitly modeled. The approach introduces text-controlled relative head poses and a novel gaze loss to achieve natural eye contact. Evaluated on a newly curated in-the-wild dialogue dataset comprising over 2 million samples, the method significantly outperforms current baselines in terms of realism and interaction coherence, demonstrating strong potential for immersive applications such as virtual reality and telepresence.
📝 Abstract
We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied"talking heads"akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore, we introduce a novel eye gaze loss to promote natural, mutual eye contact. To power our data-hungry approach, we introduce a novel pipeline to curate a large-scale conversational dataset consisting of over 2 million dyadic pairs from in-the-wild videos. Our method generates fluid, controllable, and spatially aware dyadic animations suitable for immersive applications in VR and telepresence, significantly outperforming existing baselines in perceived realism and interaction coherence.