Dynamic Multimodal Expression Generation for LLM-Driven Pedagogical Agents: From User Experience Perspective

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing virtual reality (VR) instructional agents, which often rely on static speech and simplistic gestures, thereby failing to dynamically coordinate multimodal expressions in alignment with pedagogical semantics—ultimately constraining interaction naturalness and learning efficacy. To overcome this, we propose a large language model–driven, semantics-aware multimodal generation approach that leverages semantic-sensitive prompt engineering to synchronously produce speech and gestures coherently aligned with instructional content. Our method introduces, for the first time, a dynamic and semantically consistent multimodal expression mechanism into the design of teaching agents. User studies demonstrate that the resulting VR instructional agent prototype significantly enhances learners’ sense of social presence and perceived anthropomorphism, while also improving perceived learning outcomes, engagement, willingness to use, and reducing feelings of fatigue and monotony.

Technology Category

Application Category

📝 Abstract
In virtual reality (VR) educational scenarios, Pedagogical agents (PAs) enhance immersive learning through realistic appearances and interactive behaviors. However, most existing PAs rely on static speech and simple gestures. This limitation reduces their ability to dynamically adapt to the semantic context of instructional content. As a result, interactions often lack naturalness and effectiveness in the teaching process. To address this challenge, this study proposes a large language model (LLM)-driven multimodal expression generation method that constructs semantically sensitive prompts to generate coordinated speech and gesture instructions, enabling dynamic alignment between instructional semantics and multimodal expressive behaviors. A VR-based PA prototype was developed and evaluated through user experience-oriented subjective experiments. Results indicate that dynamically generated multimodal expressions significantly enhance learners' perceived learning effectiveness, engagement, and intention to use, while effectively alleviating feelings of fatigue and boredom during the learning process. Furthermore, the combined dynamic expression of speech and gestures notably enhances learners' perceptions of human-likeness and social presence. The findings provide new insights and design guidelines for building more immersive and naturally expressive intelligent PAs.
Problem

Research questions and friction points this paper is trying to address.

Pedagogical Agents
Multimodal Expression
Dynamic Adaptation
Virtual Reality
User Experience
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven
multimodal expression generation
semantic alignment
pedagogical agents
virtual reality
N
Ninghao Wan
School of Telecommunications Engineering, Xidian University, Xi'an, 710071, China
Jiarun Song
Jiarun Song
Xidian University
F
Fuzheng Yang
School of Telecommunications Engineering, Xidian University, China; School of Electrical and Computer Engineering, Royal Melbourne Institute of Technology, Melbourne, VIC 3001, Australia