Modeling Turn-Taking with Semantically Informed Gestures

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dialogue turn-taking prediction remains challenging due to the need for fine-grained, temporally sensitive multimodal modeling, particularly in multi-party settings where nonverbal cues like gestures carry critical turn-related intent. Method: This work proposes a semantics-aware, fine-grained turn-taking prediction framework integrating semantic gestures. We extend the DnD Gesture corpus with 2,663 new fine-grained semantic gesture annotations—creating the first multi-participant dialogue dataset supporting semantic–modality alignment. We introduce a novel semantic-guided gesture representation that explicitly encodes turn-taking intention, and employ a Mixture-of-Experts architecture to jointly fuse textual, acoustic, and semantic gesture features. Contribution/Results: Experiments demonstrate significant improvements over unimodal and conventional multimodal baselines on turn-taking prediction. Results validate the essential complementary role of semantic gestures in modeling time-critical conversational behavior, establishing a new paradigm for multimodal conversational intelligence.

Technology Category

Application Category

📝 Abstract
In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
Problem

Research questions and friction points this paper is trying to address.

Modeling multimodal turn-taking using semantic gestures
Extending corpus with annotated gesture types for analysis
Integrating gestures with text and audio for prediction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends gesture corpus with semantic annotations
Models turn-taking using Mixture-of-Experts framework
Integrates text audio and gestures for prediction
🔎 Similar Papers
No similar papers found.