π€ AI Summary
This work addresses the task of human-object interaction prediction in egocentric videos by proposing a vision-language large model (VLLM) approach that integrates eye-gaze fixation trajectories with a Set-of-Mark prompting mechanism. The method enhances intention understanding through modeling usersβ recent gaze behavior and employs an inverse exponential frame sampling strategy to effectively capture critical temporal dynamics preceding interactions. Designed to be model-agnostic, the approach significantly improves visual grounding capabilities and achieves state-of-the-art performance on the HD-EPIC dataset, substantially advancing interaction prediction accuracy.
π Abstract
The ability to anticipate human-object interactions is highly desirable in an intelligent assistive system in order to guide users during daily life activities and understand their short and long-term goals. Creating systems with such capabilities requires to approach several complex challenges. This work addresses the problem of human-object interaction anticipation in Egocentric Vision using Vision Large Language Models (VLLMs). We tackle key limitations in existing approaches by improving visual grounding capabilities through Set-of-Mark prompting and understanding user intent via the trajectory formed by the user's most recent gaze fixations. To effectively capture the temporal dynamics immediately preceding the interaction, we further introduce a novel inverse exponential sampling strategy for input video frames. Experiments conducted on the egocentric dataset HD-EPIC demonstrate that our method surpasses state-of-the-art approaches for the considered task, showing its model-agnostic nature.