StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing streaming video understanding models lack explicit modeling and utilization of human eye movement signals, particularly lacking systematic evaluation in temporal reasoning and prospective understanding. Method: We introduce EGoVQA, the first eye-movement-guided benchmark for streaming video understanding, integrating first-person videos with authentic eye-tracking trajectories. Our framework establishes a three-tier evaluation protocol—covering retrospective reasoning, concurrent perception, and prospective prediction—and generates spatiotemporally aligned eye-tracking–video question-answering data via gaze point extraction, region-specific visual prompting, and scanpath modeling. Contribution/Results: Experiments reveal that state-of-the-art multimodal large models significantly underperform humans in leveraging eye movement cues, modeling user intent, and performing prospective inference, exposing their passive perceptual nature. This work establishes a novel benchmark, methodology, and conceptual insight for active, eye-movement-driven video understanding.

Technology Category

Application Category

📝 Abstract

Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications like AR glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether MLLMs can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs use gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively evaluate streaming video understanding. These tasks assess whether models can use real-time gaze to follow shifting attention and infer user intentions from only past and currently observed frames. To build StreamGaze, we develop a gaze-video QA generation pipeline that aligns egocentric videos with raw gaze trajectories via fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that closely reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, revealing fundamental limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze-prompting strategies, reasoning behaviors, and task-specific failure modes, offering deeper insight into why current MLLMs struggle and what capabilities future models must develop. All data and code will be publicly released to support continued research in gaze-guided streaming video understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLMs' use of gaze for temporal reasoning in streaming videos

Assesses models' ability to infer user intentions from real-time gaze signals

Measures proactive understanding and gaze-guided prediction in dynamic video contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaze-guided temporal reasoning benchmark for streaming videos

Gaze-video QA generation pipeline with fixation extraction

Evaluation of proactive understanding using real-time gaze signals

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs