Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Existing video Transformers struggle to model complete spatiotemporal dependencies in critical regions and long-range action dynamics due to their reliance on factorized or windowed self-attention mechanisms. Inspired by the human visual system’s “glance-and-gaze” strategy—characterized by an initial holistic glance followed by focused gaze—this work proposes the OG-ReG Transformer. The architecture incorporates a Glance pathway to capture global, coarse-grained spatiotemporal context and a Gaze pathway to attend to fine-grained local details, dynamically allocating sparse spatiotemporal attention and fusing multi-scale features. This approach introduces, for the first time, a dual-path visual attention mechanism into video understanding, moving beyond conventional uniform processing paradigms. It achieves state-of-the-art performance across multiple benchmarks, including Kinetics-400, Something-Something v2, and Diving-48.

Technology Category

Application Category

📝 Abstract

Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.

Problem

Research questions and friction points this paper is trying to address.

video understanding

spatiotemporal correlation

long-range dependencies

self-attention

motion modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Glance-Gaze Mechanism

Dual-path Transformer

Spatiotemporal Attention