🤖 AI Summary
This work addresses the problem of predicting eye-movement scanpaths during visual search under object-present conditions. We propose SemBA-FAST, the first framework that jointly integrates deep object detection, probabilistic semantic integration, and biologically grounded foveal modeling. It generates an initial top-down semantic attention map and iteratively refines fixation distributions using a dynamic foveal vision mechanism—without requiring full-sequence eye-tracking priors. This design better aligns with human cognitive principles than prior top-down approaches. Evaluated on COCO-Search18, SemBA-FAST achieves state-of-the-art performance across multiple metrics: predicted scanpaths exhibit high spatiotemporal fidelity to ground-truth human fixations, significantly outperforming mainstream top-down methods, and matching or approaching the performance of strong baselines that rely on complete sequence-level priors.
📝 Abstract
In goal-directed visual tasks, human perception is guided by both top-down and bottom-up cues. At the same time, foveal vision plays a crucial role in directing attention efficiently. Modern research on bio-inspired computational attention models has taken advantage of advancements in deep learning by utilizing human scanpath data to achieve new state-of-the-art performance. In this work, we assess the performance of SemBA-FAST, i.e. Semantic-based Bayesian Attention for Foveal Active visual Search Tasks, a top-down framework designed for predicting human visual attention in target-present visual search. SemBA-FAST integrates deep object detection with a probabilistic semantic fusion mechanism to generate attention maps dynamically, leveraging pre-trained detectors and artificial foveation to update top-down knowledge and improve fixation prediction sequentially. We evaluate SemBA-FAST on the COCO-Search18 benchmark dataset, comparing its performance against other scanpath prediction models. Our methodology achieves fixation sequences that closely match human ground-truth scanpaths. Notably, it surpasses baseline and other top-down approaches and competes, in some cases, with scanpath-informed models. These findings provide valuable insights into the capabilities of semantic-foveal probabilistic frameworks for human-like attention modelling, with implications for real-time cognitive computing and robotics.