🤖 AI Summary
This work addresses the limitations of existing virtual reality (VR) instructional agents, which often rely on static speech and simplistic gestures, thereby failing to dynamically coordinate multimodal expressions in alignment with pedagogical semantics—ultimately constraining interaction naturalness and learning efficacy. To overcome this, we propose a large language model–driven, semantics-aware multimodal generation approach that leverages semantic-sensitive prompt engineering to synchronously produce speech and gestures coherently aligned with instructional content. Our method introduces, for the first time, a dynamic and semantically consistent multimodal expression mechanism into the design of teaching agents. User studies demonstrate that the resulting VR instructional agent prototype significantly enhances learners’ sense of social presence and perceived anthropomorphism, while also improving perceived learning outcomes, engagement, willingness to use, and reducing feelings of fatigue and monotony.
📝 Abstract
In virtual reality (VR) educational scenarios, Pedagogical agents (PAs) enhance immersive learning through realistic appearances and interactive behaviors. However, most existing PAs rely on static speech and simple gestures. This limitation reduces their ability to dynamically adapt to the semantic context of instructional content. As a result, interactions often lack naturalness and effectiveness in the teaching process. To address this challenge, this study proposes a large language model (LLM)-driven multimodal expression generation method that constructs semantically sensitive prompts to generate coordinated speech and gesture instructions, enabling dynamic alignment between instructional semantics and multimodal expressive behaviors. A VR-based PA prototype was developed and evaluated through user experience-oriented subjective experiments. Results indicate that dynamically generated multimodal expressions significantly enhance learners' perceived learning effectiveness, engagement, and intention to use, while effectively alleviating feelings of fatigue and boredom during the learning process. Furthermore, the combined dynamic expression of speech and gestures notably enhances learners' perceptions of human-likeness and social presence. The findings provide new insights and design guidelines for building more immersive and naturally expressive intelligent PAs.