🤖 AI Summary
This work proposes a multimodal interaction method that integrates gaze-based selection with open-domain spoken dialogue to address the challenges of visual feedback absence and imprecise physical object referencing in display-free smart glasses. The system leverages user gaze to localize targets, employs a vision-language model to generate semantic descriptions and digital masks, and enables real-time correction of recognition errors through natural language conversation. By end-to-end integrating gaze guidance with open-vocabulary voice-based error correction—a first in this domain—the approach significantly improves the accuracy of referring to physical objects. Experimental results show a 53% gaze selection accuracy and a 58% success rate in voice-based corrections, with users rating the system as highly usable, useful, and acceptable.
📝 Abstract
Smart glasses enhance interactions with the environment by using head-mounted cameras to observe the user's viewpoint, but lack the visual feedback used for common interactions. We introduce Gazeify then Voiceify, a multimodal approach allowing object selection via gaze and voice using displayless smart glasses. Users can select a physical object with their gaze, and the system generates a digital mask and a voice description of the object's semantics. Users can further correct errors through free-form conversation. To demonstrate our approach, we develop an interactive system by integrating advanced object segmentation and detection with a vision-language model. User studies reveal that participants achieve correct gaze selection in 53% of the task trials and use voice disambiguation to correct 58% of the remaining errors. Participants also rated the system as likable, useful, and easy to use.