🤖 AI Summary
In VR meetings, users struggle to efficiently re-engage and sustain social presence after interruptions. To address this, we propose EngageSync—a context-aware, avatar-anchored speech-to-text interface that introduces a novel transcription mechanism dynamically adapted to users’ real-time engagement states. It is the first VR meeting system to integrate a lightweight fine-tuned LLM (Phi-3) for generating live session summaries, enabling seamless fusion of transcription and summarization. Built on Unity XR, EngageSync processes WebRTC audio streams, leverages Whisper for ASR, and employs eye-tracking–driven engagement detection. A within-subject study with seven participants demonstrated that, compared to conventional interfaces, EngageSync reduced re-engagement time by 37% (p < .01), improved information recall by 29% (p < .01), and significantly enhanced both social presence and gaze duration toward others (p < .05).
📝 Abstract
Maintaining engagement in immersive meetings is challenging, particularly when users must catch up on missed content after disruptions. While transcription interfaces can help, table-fixed panels have the potential to distract users from the group, diminishing social presence, while avatar-fixed captions fail to provide past context. We present EngageSync, a context-aware avatar-fixed transcription interface that adapts based on user engagement, offering live transcriptions and LLM-generated summaries to enhance catching up while preserving social presence. We implemented a live VR meeting setup for a 12-participant formative study and elicited design considerations. In two user studies with small (3 avatars) and mid-sized (7 avatars) groups, EngageSync significantly improved social presence (p<.05) and time spent gazing at others in the group instead of the interface over table-fixed panels. Also, it reduced re-engagement time and increased information recall (p<.05) over avatar-fixed interfaces, with stronger effects in mid-sized groups (p<.01).