🤖 AI Summary
This study addresses the limitation of conventional attention modeling in multimodal AI agents—its reliance on implicit assumptions and lack of explicit physiological grounding for inferring users’ true task intentions. We propose leveraging eye-tracking data as a primary attentional cue, integrating calibrated spatiotemporal scanpath sequences as structured contextual input into multimodal large language models’ reasoning pipeline. To our knowledge, this is the first systematic empirical validation—in realistic physical environments—that eye movement trajectories faithfully encode users’ task states and attentional foci. Experimental results demonstrate that our approach significantly improves AI agents’ real-time intent perception: task-relevant response accuracy increases by 37%. The method establishes a novel, interpretable, and generalizable paradigm for embodied intention inference, grounded in objective physiological signals rather than heuristic or latent attention mechanisms.
📝 Abstract
Advanced multimodal AI agents can now collaborate with users to solve challenges in the world. We explore eye tracking's role in such interaction to convey a user's attention relative to the physical environment. We hypothesize that this knowledge improves contextual understanding for AI agents. By observing hours of human-object interactions, we first measure the relationship between an eye tracker's signal quality and its ability to reliably place gaze on nearby physical objects. We then conduct experiments which relay the user's scanpath history as additional context querying multimodal agents. Our results show that eye tracking provides high value as a user attention signal and can convey information about the user's current task and interests to the agent.