Large Language Models for Virtual Human Gesture Selection

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Addressing core challenges in virtual human interaction—including co-speech gesture selection difficulty, semantic mismatch, and temporal asynchrony—this paper introduces the first large language model (LLM)-driven co-speech gesture generation framework powered by GPT-4. Our method employs structured gesture knowledge encoding and multi-strategy, context-aware prompt engineering to achieve precise semantic mapping from speech to gesture motion, and seamlessly integrates into Unity/Unreal animation pipelines. It is the first to systematically leverage GPT-4’s deep semantic modeling capability for gesture recommendation while preserving real-time deployability. Quantitative and human evaluations demonstrate significant improvements: +42% in gesture semantic relevance (per expert annotation) and sub-0.3-second average alignment error between speech and gesture timing. Experimental results confirm enhanced user engagement and agent credibility.

Technology Category

Application Category

📝 Abstract

Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. These gestures significantly influence the addressee's engagement, recall, comprehension, and attitudes toward the speaker. Similarly, they impact interactions between humans and embodied virtual agents. The process of selecting and animating meaningful gestures has thus become a key focus in the design of these agents. However, automating this gesture selection process poses a significant challenge. Prior gesture generation techniques have varied from fully automated, data-driven methods, which often struggle to produce contextually meaningful gestures, to more manual approaches that require crafting specific gesture expertise and are time-consuming and lack generalizability. In this paper, we leverage the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures. We first describe how information on gestures is encoded into GPT-4. Then, we conduct a study to evaluate alternative prompting approaches for their ability to select meaningful, contextually relevant gestures and to align them appropriately with the co-speech utterance. Finally, we detail and demonstrate how this approach has been implemented within a virtual agent system, automating the selection and subsequent animation of the selected gestures for enhanced human-agent interactions.

Problem

Research questions and friction points this paper is trying to address.

Automating meaningful co-speech gesture selection for virtual agents.

Overcoming limitations of prior gesture generation techniques.

Enhancing human-agent interactions through contextually relevant gestures.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages GPT-4 for gesture selection

Automates meaningful co-speech gesture animation

Evaluates prompting for contextual gesture alignment

🔎 Similar Papers

No similar papers found.