Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

πŸ“… 2026-04-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the task of human-object interaction prediction in egocentric videos by proposing a vision-language large model (VLLM) approach that integrates eye-gaze fixation trajectories with a Set-of-Mark prompting mechanism. The method enhances intention understanding through modeling users’ recent gaze behavior and employs an inverse exponential frame sampling strategy to effectively capture critical temporal dynamics preceding interactions. Designed to be model-agnostic, the approach significantly improves visual grounding capabilities and achieves state-of-the-art performance on the HD-EPIC dataset, substantially advancing interaction prediction accuracy.
πŸ“ Abstract
The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.
Problem

Research questions and friction points this paper is trying to address.

human-object interaction anticipation
egocentric vision
vision large language models
visual grounding
gaze trajectory
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Large Language Models
Set-of-Mark prompting
gaze trajectory
inverse exponential sampling
egocentric vision
πŸ”Ž Similar Papers
No similar papers found.
D
Daniele Materia
Department of Mathematics and Computer Science – University of Catania, Italy
F
Francesco Ragusa
Department of Mathematics and Computer Science – University of Catania, Italy; Next Vision s.r.l. – Spinoff of the University of Catania, Italy
Giovanni Maria Farinella
Giovanni Maria Farinella
University of Catania
Computer VisionMachine Learning