Gazeify Then Voiceify: Physical Object Referencing Through Gaze and Voice Interaction with Displayless Smart Glasses

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work proposes a multimodal interaction method that integrates gaze-based selection with open-domain spoken dialogue to address the challenges of visual feedback absence and imprecise physical object referencing in display-free smart glasses. The system leverages user gaze to localize targets, employs a vision-language model to generate semantic descriptions and digital masks, and enables real-time correction of recognition errors through natural language conversation. By end-to-end integrating gaze guidance with open-vocabulary voice-based error correction—a first in this domain—the approach significantly improves the accuracy of referring to physical objects. Experimental results show a 53% gaze selection accuracy and a 58% success rate in voice-based corrections, with users rating the system as highly usable, useful, and acceptable.

Technology Category

Application Category

📝 Abstract

Smart glasses enhance interactions with the environment by using head-mounted cameras to observe the user's viewpoint, but lack the visual feedback used for common interactions. We introduce Gazeify then Voiceify, a multimodal approach allowing object selection via gaze and voice using displayless smart glasses. Users can select a physical object with their gaze, and the system generates a digital mask and a voice description of the object's semantics. Users can further correct errors through free-form conversation. To demonstrate our approach, we develop an interactive system by integrating advanced object segmentation and detection with a vision-language model. User studies reveal that participants achieve correct gaze selection in 53% of the task trials and use voice disambiguation to correct 58% of the remaining errors. Participants also rated the system as likable, useful, and easy to use.

Problem

Research questions and friction points this paper is trying to address.

displayless smart glasses

physical object referencing

gaze interaction

voice interaction

multimodal interaction

Innovation

Methods, ideas, or system contributions that make the work stand out.

gaze interaction

voice interaction

displayless smart glasses