🤖 AI Summary
This study addresses the problem of predicting human visual scanpaths when searching for target objects under linguistic referring expressions. To this end, the authors propose ScanVLA, a novel model that introduces vision-language models (VLMs) to this task for the first time. ScanVLA integrates multimodal features by fusing visual and linguistic semantics and incorporates a History-Enhanced Scanpath Decoder (HESD) to effectively model gaze history. Furthermore, a frozen segmentation LoRA module is introduced to enhance spatial awareness with negligible computational overhead, significantly improving localization accuracy. Experimental results demonstrate that ScanVLA substantially outperforms existing methods on the task of referential scanpath prediction.
📝 Abstract
Object Referring-guided Scanpath Prediction (ORSP) aims to predict the human attention scanpath when they search for a specific target object in a visual scene according to a linguistic description describing the object. Multimodal information fusion is a key point of ORSP. Therefore, we propose a novel model, ScanVLA, to first exploit a Vision-Language Model (VLM) to extract and fuse inherently aligned visual and linguistic feature representations from the input image and referring expression. Next, to enhance the ScanVLA's perception of fine-grained positional information, we not only propose a novel History Enhanced Scanpath Decoder (HESD) that directly takes historical fixations' position information as input to help predict a more reasonable position for the current fixation, but also adopt a frozen Segmentation LoRA as an auxiliary component to help localize the referred object more precisely, which improves the scanpath prediction task without incurring additional large computational and time costs. Extensive experimental results demonstrate that ScanVLA can significantly outperform existing scanpath prediction methods under object referring.