DiTaiListener: Controllable High Fidelity Listener Video Generation with Diffusion

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing listener motion generation methods are constrained by low-dimensional motion representations and photorealistic rendering, struggling to simultaneously ensure long-term temporal coherence and high visual fidelity. This paper introduces DiTaiListener—the first end-to-end framework for high-fidelity listener video generation. Methodologically, we design a Causal Temporal Multimodal Adapter (CTM-Adapter) to achieve precise temporal alignment among speech, facial, and head motions; and propose a two-stage paradigm—segmented generation followed by transition refinement—integrating a video diffusion transformer (DiT) with a video-to-video refinement module (DiTaiListener-Edit), substantially improving long-video coherence and fine-grained detail quality. On the RealTalk and VICO benchmarks, DiTaiListener reduces FID by 73.8% and improves Frechet Distance (FD) by 6.1%. User studies confirm statistically significant superiority over state-of-the-art methods in feedback quality, motion diversity, and motion smoothness.

Technology Category

Application Category

📝 Abstract

Generating naturalistic and nuanced listener motions for extended interactions remains an open problem. Existing methods often rely on low-dimensional motion codes for facial behavior generation followed by photorealistic rendering, limiting both visual fidelity and expressive richness. To address these challenges, we introduce DiTaiListener, powered by a video diffusion model with multimodal conditions. Our approach first generates short segments of listener responses conditioned on the speaker's speech and facial motions with DiTaiListener-Gen. It then refines the transitional frames via DiTaiListener-Edit for a seamless transition. Specifically, DiTaiListener-Gen adapts a Diffusion Transformer (DiT) for the task of listener head portrait generation by introducing a Causal Temporal Multimodal Adapter (CTM-Adapter) to process speakers' auditory and visual cues. CTM-Adapter integrates speakers' input in a causal manner into the video generation process to ensure temporally coherent listener responses. For long-form video generation, we introduce DiTaiListener-Edit, a transition refinement video-to-video diffusion model. The model fuses video segments into smooth and continuous videos, ensuring temporal consistency in facial expressions and image quality when merging short video segments produced by DiTaiListener-Gen. Quantitatively, DiTaiListener achieves the state-of-the-art performance on benchmark datasets in both photorealism (+73.8% in FID on RealTalk) and motion representation (+6.1% in FD metric on VICO) spaces. User studies confirm the superior performance of DiTaiListener, with the model being the clear preference in terms of feedback, diversity, and smoothness, outperforming competitors by a significant margin.

Problem

Research questions and friction points this paper is trying to address.

Generating naturalistic listener motions for extended interactions

Improving visual fidelity and expressive richness in listener responses

Ensuring seamless transitions in long-form listener video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video diffusion model with multimodal conditions

Introduces Causal Temporal Multimodal Adapter (CTM-Adapter)

Employs transition refinement video-to-video diffusion model

🔎 Similar Papers

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency