Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

📅 2025-12-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D conversational head generation methods typically model speaking and listening as independent or non-causal processes, failing to capture inter-turn temporal coherence and the bidirectional dynamic coupling between speech and nonverbal cues (e.g., nodding, gaze shifts, micro-expressions). To address this, we propose TIMAR—a novel framework introducing turn-level causal attention and an interleaved masked autoregressive (IMAR) structure to enable history-aware, causally constrained sequence modeling. Additionally, we design a lightweight diffusion-based head module to jointly optimize motion coordination and expressive diversity. Evaluated on the DualTalk benchmark, TIMAR achieves 15–30% reductions in Fréchet distance and MSE over prior methods, while demonstrating significantly improved cross-distribution generalization. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.
Problem

Research questions and friction points this paper is trying to address.

Model bidirectional 3D head dynamics in conversation
Enable causal, turn-level modeling for temporal coherence
Generate expressive, coordinated avatars from audio-visual contexts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal turn-level modeling for 3D head generation
Interleaved audio-visual context fusion within turns
Lightweight diffusion head predicts continuous expressive dynamics
🔎 Similar Papers
No similar papers found.