Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

While existing vision foundation models enhance scene understanding in gaze-following tasks, they often rely excessively on semantically salient objects and overlook genuine gaze cues, leading to limited performance in localizing non-salient targets. To address this issue, this work proposes a head-conditioned local LoRA module combined with an extra-cone penalty mechanism. The former enables localized, adaptive fine-tuning within the head region, while the latter explicitly injects geometrically constrained gaze cues into head representations, effectively disentangling and strengthening gaze reasoning capabilities. The proposed method achieves state-of-the-art performance on both GazeFollow and VAT benchmarks, demonstrating particularly significant improvements over existing approaches in scenarios involving non-salient targets.

📝 Abstract

Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an out-of-cone penalty, which injects gaze cues into head tokens while aligning them with scene tokens. Experiments on the GazeFollow and VAT datasets demonstrate that our method achieves state-of-the-art performance, with particularly strong improvements when gaze targets are not semantically salient. Our findings offer valuable insights for advancing future gaze following research. We will release the code once the paper is accepted.

Problem

Research questions and friction points this paper is trying to address.

gaze following

gaze reasoning

vision foundation models

semantic saliency

gaze target localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

gaze reasoning

vision foundation models

LoRA