GazeNLQ @ Ego4D Natural Language Queries Challenge 2025

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses natural language query (NLQ) localization in first-person videos by leveraging wearer gaze as a cognitive prior to enhance video representation learning. We propose the first end-to-end integration of gaze estimation into an NLQ framework: (1) a contrastive learning-based pretraining strategy for video gaze modeling; (2) a gaze-aware video–language temporal alignment mechanism; and (3) deep fusion of gaze features with multimodal temporal modeling. Evaluated on the Ego4D NLQ benchmark, our method achieves R1@IoU0.3 = 27.82 and R1@IoU0.5 = 18.68—substantially outperforming gaze-agnostic baselines. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer's perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve video segments that match given natural language queries. Specifically, we introduce a contrastive learning-based pretraining strategy for gaze estimation directly from video. The estimated gaze is used to augment video representations within proposed model, thereby enhancing localization accuracy. Experimental results show that GazeNLQ achieves R1@IoU0.3 and R1@IoU0.5 scores of 27.82 and 18.68, respectively. Our code is available at https://github.com/stevenlin510/GazeNLQ.
Problem

Research questions and friction points this paper is trying to address.

Leveraging gaze to match video segments with natural language queries
Improving video localization accuracy using gaze-augmented representations
Developing contrastive learning-based pretraining for gaze estimation from video
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages gaze for video segment retrieval
Contrastive learning for gaze estimation
Augments video representations with gaze
🔎 Similar Papers
No similar papers found.
Wei-Cheng Lin
Wei-Cheng Lin
PhD student of Electrical & Computer Engineering, University of Texas at Dallas
affective computingspeech signal processingmachine learningmultimodal signal processing
C
Chih-Ming Lien
National Taiwan Normal University
C
Chen Lo
National Taiwan Normal University
C
Chia-Hung Yeh
National Taiwan Normal University, National Sun Yat-sen University