Augmenting speech transcripts of VR recordings with gaze, pointing, and visual context for multimodal coreference resolution

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

VR speech transcription suffers from the absence of visual context and nonverbal cues—such as gaze, gestures, and pointing—hindering robust anaphora resolution (e.g., interpreting “it” or “there”). To address this, we propose the first multimodal transcription framework that jointly integrates eye-tracking trajectories, laser-pointer trajectories, and scene metadata. Our method structurally encodes nonverbal behaviors and injects them into a GPT-based language model to enable semantics-augmented anaphora resolution. Evaluated on authentic design critique tasks involving 12 users, our approach achieves a 26.5% absolute improvement in coreference accuracy over a speech-only baseline. This work constitutes the first systematic integration of gaze, pointing, and visual scene information for VR dialogue understanding, significantly advancing multimodal anaphora resolution. It establishes a scalable technical pathway for semantic understanding in immersive human–computer interaction.

Technology Category

Application Category

📝 Abstract

Understanding transcripts of immersive multimodal conversations is challenging because speakers frequently rely on visual context and non-verbal cues, such as gestures and visual attention, which are not captured in speech alone. This lack of information makes coreferences resolution-the task of linking ambiguous expressions like ``it'' or ``there'' to their intended referents-particularly challenging. In this paper we present a system that augments VR speech transcript with eye-tracking laser pointing data, and scene metadata to generate textual descriptions of non-verbal communication and the corresponding objects of interest. To evaluate the system, we collected gaze, gesture, and voice data from 12 participants (6 pairs) engaged in an open-ended design critique of a 3D model of an apartment. Our results show a 26.5% improvement in coreference resolution accuracy by a GPT model when using our multimodal transcript compared to a speech-only baseline.

Problem

Research questions and friction points this paper is trying to address.

Resolving ambiguous references in VR speech transcripts

Incorporating gaze and pointing for multimodal coreference resolution

Enhancing transcript understanding with visual context metadata

Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmenting VR transcripts with gaze and pointing data

Generating textual descriptions of non-verbal communication cues

Integrating eye-tracking and scene metadata for coreference resolution

🔎 Similar Papers

No similar papers found.