Gaze Beyond the Frame: Forecasting Egocentric 3D Visual Span

📅 2025-11-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses egocentric 3D visual span prediction—the task of forecasting where an individual will next fixate in a 3D environment from a first-person perspective. We propose EgoSpanLift, the first method to extend visual span prediction from 2D image planes to 3D scene space. It jointly models spatiotemporal features by integrating SLAM-derived 3D keypoints, a 3D U-Net backbone, and a unidirectional Transformer, and outputs voxelized 3D fixation regions. To support this line of research, we introduce the first large-scale, multimodal 3D benchmark dataset. Experiments demonstrate that EgoSpanLift significantly outperforms existing baselines on 3D prediction metrics; moreover, its 2D projection achieves performance comparable to state-of-the-art dedicated 2D models—without requiring any additional 2D supervision. This work establishes a novel paradigm for anticipatory visual perception modeling in AR/VR and intelligent assistive systems.

Technology Category

Application Category

📝 Abstract
People continuously perceive and interact with their surroundings based on underlying intentions that drive their exploration and behaviors. While research in egocentric user and scene understanding has focused primarily on motion and contact-based interaction, forecasting human visual perception itself remains less explored despite its fundamental role in guiding human actions and its implications for AR/VR and assistive technologies. We address the challenge of egocentric 3D visual span forecasting, predicting where a person's visual perception will focus next within their three-dimensional environment. To this end, we propose EgoSpanLift, a novel method that transforms egocentric visual span forecasting from 2D image planes to 3D scenes. EgoSpanLift converts SLAM-derived keypoints into gaze-compatible geometry and extracts volumetric visual span regions. We further combine EgoSpanLift with 3D U-Net and unidirectional transformers, enabling spatio-temporal fusion to efficiently predict future visual span in the 3D grid. In addition, we curate a comprehensive benchmark from raw egocentric multisensory data, creating a testbed with 364.6K samples for 3D visual span forecasting. Our approach outperforms competitive baselines for egocentric 2D gaze anticipation and 3D localization while achieving comparable results even when projected back onto 2D image planes without additional 2D-specific training.
Problem

Research questions and friction points this paper is trying to address.

Forecasting human visual perception focus in 3D environments
Transforming egocentric visual span prediction from 2D to 3D scenes
Creating benchmark for 3D visual span forecasting with multisensory data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms 2D gaze forecasting to 3D scenes
Converts SLAM keypoints into gaze-compatible geometry
Combines 3D U-Net with transformers for spatio-temporal fusion
🔎 Similar Papers
No similar papers found.