Leveraging Gaze and Set-of-Mark in VLLMs for Human-Object Interaction Anticipation from Egocentric Videos

📅 2026-04-04

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the task of human-object interaction prediction in egocentric videos by proposing a vision-language large model (VLLM) approach that integrates eye-gaze fixation trajectories with a Set-of-Mark prompting mechanism. The method enhances intention understanding through modeling users’ recent gaze behavior and employs an inverse exponential frame sampling strategy to effectively capture critical temporal dynamics preceding interactions. Designed to be model-agnostic, the approach significantly improves visual grounding capabilities and achieves state-of-the-art performance on the HD-EPIC dataset, substantially advancing interaction prediction accuracy.

Technology Category

Application Category

📝 Abstract

The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.

Problem

Research questions and friction points this paper is trying to address.

human-object interaction anticipation

egocentric vision

vision large language models

visual grounding

gaze trajectory

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Large Language Models

Set-of-Mark prompting

gaze trajectory