GazeMoE: Perception of Gaze Target with Mixture-of-Experts

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of estimating human gaze targets from visible images and proposes GazeMoE, a novel framework designed to enhance model generalization and robustness by leveraging multimodal cues. Built upon a frozen vision foundation model, GazeMoE introduces a mixture-of-experts (MoE) mechanism—applied for the first time in gaze target estimation—to adaptively fuse multimodal signals including eye movements, head pose, hand gestures, and scene context. The approach further incorporates region-cropping and photometric data augmentation strategies alongside a class-balanced auxiliary loss to effectively mitigate intra- and inter-frame class imbalance. Extensive experiments demonstrate that GazeMoE achieves state-of-the-art performance across multiple benchmark datasets, with particularly pronounced gains in complex real-world scenarios.

Technology Category

Application Category

📝 Abstract
Estimating human gaze target from visible images is a critical task for robots to understand human attention, yet the development of generalizable neural architectures and training paradigms remains challenging. While recent advances in pre-trained vision foundation models offer promising avenues for locating gaze targets, the integration of multi-modal cues -- including eyes, head poses, gestures, and contextual features -- demands adaptive and efficient decoding mechanisms. Inspired by Mixture-of-Experts (MoE) for adaptive domain expertise in large vision-language models, we propose GazeMoE, a novel end-to-end framework that selectively leverages gaze-target-related cues from a frozen foundation model through MoE modules. To address class imbalance in gaze target classification (in-frame vs. out-of-frame) and enhance robustness, GazeMoE incorporates a class-balancing auxiliary loss alongside strategic data augmentations, including region-specific cropping and photometric transformations. Extensive experiments on benchmark datasets demonstrate that our GazeMoE achieves state-of-the-art performance, outperforming existing methods on challenging gaze estimation tasks. The code and pre-trained models are released at https://huggingface.co/zdai257/GazeMoE
Problem

Research questions and friction points this paper is trying to address.

gaze target estimation
multi-modal cues
class imbalance
generalizable neural architectures
human attention understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
gaze target estimation
foundation model
class-balanced loss
multi-modal cues
🔎 Similar Papers
No similar papers found.
Zhuangzhuang Dai
Zhuangzhuang Dai
Aston University
Embedded SystemsMahcine LearningComputer VisionSLAMNavigation
Z
Zhongxi Lu
Computing Science, University of Leicester, Leicester, United Kingdom
V
Vincent G. Zakka
Dept. of Applied AI and Robotics, Aston University, Birmingham, United Kingdom
Luis J. Manso
Luis J. Manso
Senior Lecturer (Associate Professor) in Computer Science, Aston University, UK
autonomous roboticsactive perceptionsocial navigationhuman-robot interaction
J
Jose M Alcaraz Calero
Dept. of Applied AI and Robotics, Aston University, Birmingham, United Kingdom
C
Chen Li
Dept. of Materials and Production, Aalborg University, Aalborg, Denmark