🤖 AI Summary
This work addresses the challenge of end-to-end audiovisual co-generation for photorealistic 4D talking avatars—integrating speech, lip motion, facial expression, and head pose—from text alone. Methodologically, we propose the first unified diffusion-based framework built upon a dual-parallel Transformer architecture, augmented with cross-modal highway connections to enforce strict audiovisual synchronization and trained via flow matching to enhance both generation fidelity and inference efficiency. Our approach is the first to enable active listening and real-time responsive interaction in both single-speaker broadcasting and two-party conversational settings. Quantitative and perceptual evaluations demonstrate state-of-the-art performance in naturalness, temporal alignment, and expressiveness. A user study confirms significantly higher human-likeness compared to existing methods.
📝 Abstract
We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars. Project page: https://aggelinacha.github.io/AV-Flow/