Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

📅 2024-12-12
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of gaze target localization in natural scenes. We propose an end-to-end unified modeling framework that departs from conventional multi-encoder handcrafted fusion paradigms: (1) DINOv2 is frozen as a shared visual backbone to extract robust scene-level semantic features; (2) a learnable positional prompt explicitly encodes spatial priors regarding human observer location; and (3) a lightweight attention-based decoder enables efficient gaze point regression. To our knowledge, this is the first approach to synergistically integrate a frozen vision foundation model with a positional prompting mechanism for gaze estimation. Our method achieves state-of-the-art performance on major benchmarks—including Gaze360 and ETH-XGaze—while significantly improving cross-scene generalization and inference efficiency (2.1× faster than baseline methods). The source code is publicly available.

Technology Category

Application Category

📝 Abstract
We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .
Problem

Research questions and friction points this paper is trying to address.

Predicting gaze targets in complex scenes
Streamlining estimation with transformer framework
Leveraging DINOv2 encoder for feature extraction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen DINOv2 encoder for feature extraction
Applies person-specific positional prompts
Lightweight module for gaze decoding
🔎 Similar Papers
No similar papers found.
Fiona Ryan
Fiona Ryan
Georgia Institute of Technology
artificial intelligencecomputer vision
A
Ajay Bati
Georgia Institute of Technology
S
Sangmin Lee
University of Illinois Urbana-Champaign
Daniel Bolya
Daniel Bolya
Meta, FAIR
Computer Vision and Machine Learning and Artificial Intelligence
J
Judy Hoffman
Georgia Institute of Technology
J
J. Rehg
University of Illinois Urbana-Champaign