Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues

📅 2025-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the neglect of nonverbal cues in spoken discourse modeling by proposing a gesture-augmented language modeling approach. Methodologically, we introduce VQ-VAE for the first time to discretize 3D human motion sequences into learnable gesture tokens, whose embedding space is aligned with that of a text encoder via feature alignment; we further design a text cloze task to jointly optimize linguistic and gestural representations. Our key contribution is the construction of the first computationally tractable and cross-modally alignable gesture supervision signal, thereby introducing structured nonverbal priors into spoken language understanding. Experiments demonstrate that the gesture-augmented model significantly improves prediction accuracy on three critical discourse cue categories—discourse connectives, stance markers, and quantifiers—validating the complementary and transferable value of nonverbal modalities for language modeling.

Technology Category

Application Category

📝 Abstract
Research in linguistics shows that non-verbal cues, such as gestures, play a crucial role in spoken discourse. For example, speakers perform hand gestures to indicate topic shifts, helping listeners identify transitions in discourse. In this work, we investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling in language models. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. These gesture token embeddings are then aligned with text embeddings through feature alignment, mapping them into the text embedding space. To evaluate the gesture-aligned language model on spoken discourse, we construct text infilling tasks targeting three key discourse cues grounded in linguistic research: discourse connectives, stance markers, and quantifiers. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks, highlighting the complementary information that gestures can offer in modeling spoken discourse. We view this work as an initial step toward leveraging non-verbal cues to advance spoken language modeling in language models.
Problem

Research questions and friction points this paper is trying to address.

Investigates joint modeling of gestures and language for spoken discourse.
Encodes 3D motion into gesture tokens aligned with text embeddings.
Evaluates gesture-aligned models on discourse cues like connectives and stance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrate gestures using VQ-VAE encoding
Align gesture tokens with text embeddings
Enhance discourse modeling with gesture cues
🔎 Similar Papers
No similar papers found.