🤖 AI Summary
Existing listener motion generation methods are constrained by low-dimensional motion representations and photorealistic rendering, struggling to simultaneously ensure long-term temporal coherence and high visual fidelity. This paper introduces DiTaiListener—the first end-to-end framework for high-fidelity listener video generation. Methodologically, we design a Causal Temporal Multimodal Adapter (CTM-Adapter) to achieve precise temporal alignment among speech, facial, and head motions; and propose a two-stage paradigm—segmented generation followed by transition refinement—integrating a video diffusion transformer (DiT) with a video-to-video refinement module (DiTaiListener-Edit), substantially improving long-video coherence and fine-grained detail quality. On the RealTalk and VICO benchmarks, DiTaiListener reduces FID by 73.8% and improves Frechet Distance (FD) by 6.1%. User studies confirm statistically significant superiority over state-of-the-art methods in feedback quality, motion diversity, and motion smoothness.
📝 Abstract
Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.