ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively modeling object-based visual attention in street scenes, a task hindered by the absence of suitable interactive datasets and evaluation metrics. To bridge this gap, the authors introduce the first virtual reality eye-tracking dataset for object-based attention during street crossing, collected from 120 participants and annotated with high-precision gaze data, panoptic segmentation, depth maps, and vehicle keypoints. They further propose oSIM, a novel metric to quantify object-level attention, and present SUMGraph, a model that explicitly captures attention to salient objects by integrating graph neural networks with a Mamba U-Net architecture. Experiments demonstrate that the proposed approach significantly outperforms existing baselines on oSIM and achieves state-of-the-art performance on general attention prediction benchmarks, confirming its effectiveness and generalization capability.

Technology Category

Application Category

📝 Abstract
The object-based nature of human visual attention is well-known in cognitive science, but has only played a minor role in computational visual attention models so far. This is mainly due to a lack of suitable datasets and evaluation metrics for object-based attention. To address these limitations, we present ObjectVisA-120 -- a novel 120-participant dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations. The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult. ObjectVisA-120 not only features accurate gaze data and a complete state-space representation of objects in the virtual environment, but it also offers variable scenario complexities and rich annotations, including panoptic segmentation, depth information, and vehicle keypoints. We further propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models, a previously unexplored performance characteristic. Our evaluations show that explicitly optimising for object-based attention not only improves oSIM performance but also leads to an improved model performance on common metrics. In addition, we present SUMGraph, a Mamba U-Net-based model, which explicitly encodes critical scene objects (vehicles) in a graph representation, leading to further performance improvements over several state-of-the-art visual attention prediction methods. The dataset, code and models will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

object-based visual attention
attention prediction
dataset
evaluation metrics
street-crossing environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

object-based visual attention
virtual reality dataset
oSIM metric
graph representation
Mamba U-Net
🔎 Similar Papers
No similar papers found.
Igor Vozniak
Igor Vozniak
Researcher / PhD candidate
AI
P
Philipp Mueller
German Research Center for Artificial Intelligence (DFKI) GmbH, Campus D32, 66123 Saarbruecken, Germany; Max Planck Institute for Intelligent Systems, 70569 Stuttgart, Germany
N
Nils Lipp
German Research Center for Artificial Intelligence (DFKI) GmbH, Campus D32, 66123 Saarbruecken, Germany
J
J. Sprenger
German Research Center for Artificial Intelligence (DFKI) GmbH, Campus D32, 66123 Saarbruecken, Germany
K
Konstantin Poddubnyy
German Research Center for Artificial Intelligence (DFKI) GmbH, Campus D32, 66123 Saarbruecken, Germany
D
Davit Hovhannisyan
German Research Center for Artificial Intelligence (DFKI) GmbH, Campus D32, 66123 Saarbruecken, Germany
C
Christian Mueller
German Research Center for Artificial Intelligence (DFKI) GmbH, Campus D32, 66123 Saarbruecken, Germany
Andreas Bulling
Andreas Bulling
Professor of Computer Science, University of Stuttgart
Human-Computer InteractionComputer VisionMachine LearningCollaborative AIEye Tracking
Philipp Slusallek
Philipp Slusallek
Professor for Computer Graphics, Saarland University & DFKI, Saarland Informatics Campus
Visual ComputingComputer GraphicsArtificial Intelligence & Machine LearningHigh-Performance Computing