🤖 AI Summary
This study addresses the limited naturalness and responsiveness of avatar-mediated dialogue in VR. We propose a locally deployed large language model (LLM)-driven real-time multi-avatar dialogue system, integrating automatic speech recognition (ASR), text-to-speech (TTS), and lip-sync rendering. Our method introduces an LLM-guided finite-state machine for avatar behavior control and a retrieval-augmented generation (RAG)-enhanced context-aware response mechanism. Crucially, we conduct the first systematic evaluation in VR of avatar state indicators—namely, thinking, listening, and generating states—on perceived response latency and user immersion. A pilot user study demonstrates that these three states significantly improve realism and response predictability in task-oriented dialogues. The work contributes a reusable system architecture for VR-based task-oriented conversational AI and empirically grounded design guidelines for avatar state signaling.
📝 Abstract
We present a virtual reality (VR) environment featuring conversational avatars powered by a locally-deployed LLM, integrated with automatic speech recognition (ASR), text-to-speech (TTS), and lip-syncing. Through a pilot study, we explored the effects of three types of avatar status indicators during response generation. Our findings reveal design considerations for improving responsiveness and realism in LLM-driven conversational systems. We also detail two system architectures: one using an LLM-based state machine to control avatar behavior and another integrating retrieval-augmented generation (RAG) for context-grounded responses. Together, these contributions offer practical insights to guide future work in developing task-oriented conversational AI in VR environments.