🤖 AI Summary
VR speech transcription suffers from the absence of visual context and nonverbal cues—such as gaze, gestures, and pointing—hindering robust anaphora resolution (e.g., interpreting “it” or “there”). To address this, we propose the first multimodal transcription framework that jointly integrates eye-tracking trajectories, laser-pointer trajectories, and scene metadata. Our method structurally encodes nonverbal behaviors and injects them into a GPT-based language model to enable semantics-augmented anaphora resolution. Evaluated on authentic design critique tasks involving 12 users, our approach achieves a 26.5% absolute improvement in coreference accuracy over a speech-only baseline. This work constitutes the first systematic integration of gaze, pointing, and visual scene information for VR dialogue understanding, significantly advancing multimodal anaphora resolution. It establishes a scalable technical pathway for semantic understanding in immersive human–computer interaction.
📝 Abstract
Understanding transcripts of immersive multimodal conversations is challenging because speakers frequently rely on visual context and non-verbal cues, such as gestures and visual attention, which are not captured in speech alone. This lack of information makes coreferences resolution-the task of linking ambiguous expressions like ``it'' or ``there'' to their intended referents-particularly challenging. In this paper we present a system that augments VR speech transcript with eye-tracking laser pointing data, and scene metadata to generate textual descriptions of non-verbal communication and the corresponding objects of interest. To evaluate the system, we collected gaze, gesture, and voice data from 12 participants (6 pairs) engaged in an open-ended design critique of a 3D model of an apartment. Our results show a 26.5% improvement in coreference resolution accuracy by a GPT model when using our multimodal transcript compared to a speech-only baseline.