DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limitation of existing gesture generation methods, which predominantly rely on single-speaker audio and overlook the social context and dyadic interaction dynamics inherent in conversational settings. To this end, we propose DyaDiT, a multimodal diffusion Transformer that pioneers the modeling of mutual generation mechanisms in two-person dialogues. By integrating binaural audio, social context tokens, and a motion dictionary, DyaDiT enables responsive co-speech gesture synthesis that dynamically reacts to the interlocutor’s posture. Experimental results demonstrate that DyaDiT significantly outperforms current state-of-the-art approaches across both objective metrics and user studies, producing gestures perceived as more natural and strongly preferred by human evaluators.

Technology Category

Application Category

📝 Abstract

Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

dyadic gesture generation

social context

conversational interaction

multi-modal motion synthesis

human motion generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

dyadic gesture generation

multi-modal diffusion transformer

socially favorable motion