Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video Transformers struggle to model complete spatiotemporal dependencies in critical regions and long-range action dynamics due to their reliance on factorized or windowed self-attention mechanisms. Inspired by the human visual system’s “glance-and-gaze” strategy—characterized by an initial holistic glance followed by focused gaze—this work proposes the OG-ReG Transformer. The architecture incorporates a Glance pathway to capture global, coarse-grained spatiotemporal context and a Gaze pathway to attend to fine-grained local details, dynamically allocating sparse spatiotemporal attention and fusing multi-scale features. This approach introduces, for the first time, a dual-path visual attention mechanism into video understanding, moving beyond conventional uniform processing paradigms. It achieves state-of-the-art performance across multiple benchmarks, including Kinetics-400, Something-Something v2, and Diving-48.
📝 Abstract
Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.
Problem

Research questions and friction points this paper is trying to address.

video understanding
spatiotemporal correlation
long-range dependencies
self-attention
motion modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Glance-Gaze Mechanism
Dual-path Transformer
Spatiotemporal Attention
Video Action Recognition
Human Visual Cognition
🔎 Similar Papers
No similar papers found.