ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work addresses the challenge of effectively modeling object-based visual attention in street scenes, a task hindered by the absence of suitable interactive datasets and evaluation metrics. To bridge this gap, the authors introduce the first virtual reality eye-tracking dataset for object-based attention during street crossing, collected from 120 participants and annotated with high-precision gaze data, panoptic segmentation, depth maps, and vehicle keypoints. They further propose oSIM, a novel metric to quantify object-level attention, and present SUMGraph, a model that explicitly captures attention to salient objects by integrating graph neural networks with a Mamba U-Net architecture. Experiments demonstrate that the proposed approach significantly outperforms existing baselines on oSIM and achieves state-of-the-art performance on general attention prediction benchmarks, confirming its effectiveness and generalization capability.

Technology Category

Application Category

📝 Abstract

The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present ObjectVisA-120 -- a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. ObjectVisA-120 not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.

Problem

Research questions and friction points this paper is trying to address.

object-based visual attention

attention prediction

dataset

evaluation metrics

street-crossing environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

object-based visual attention

virtual reality dataset

oSIM metric

graph representation

Mamba U-Net

🔎 Similar Papers

No similar papers found.