InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing audio-driven video generation methods in modeling cross-individual dependencies in dyadic interactions and enabling fine-grained control over reactive behaviors. To this end, we propose a query-based intermediate visual motion guidance framework that aligns identity-agnostic motion priors with audio semantic intent to generate contextually coherent and natural interactive dynamics. Our key innovations include an Interactivity Injector, a MetaQuery multimodal alignment mechanism, and Role-aware Dual-person Gaussian guidance (RoDG). Furthermore, we establish the first evaluation benchmark specifically designed for dyadic interaction synthesis. Experimental results demonstrate that our approach significantly outperforms state-of-the-art methods in terms of naturalness, contextual relevance, and lip-sync accuracy, maintaining high-quality spatial consistency and audiovisual synchronization even under extreme head poses.

Technology Category

Application Category

📝 Abstract
Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under extreme head poses, we propose Role-aware Dyadic Gaussian Guidance (RoDG) for enhanced lip-synchronization and spatial consistency. Finally, we introduce a dedicated evaluation suite with novelly designed metrics to quantify dyadic interaction. Comprehensive experiments demonstrate that InterDyad significantly outperforms state-of-the-art methods in producing natural and contextually grounded two-person interactions. Please refer to our project page for demo videos: https://interdyad.github.io/.
Problem

Research questions and friction points this paper is trying to address.

speech-to-video synthesis
dyadic interaction
cross-individual dependencies
reactive behaviors
lip synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive Dyadic Generation
MetaQuery-based Alignment
Multimodal Large Language Model (MLLM)
Role-aware Gaussian Guidance
Motion Prior Querying
🔎 Similar Papers
No similar papers found.