🤖 AI Summary
This study addresses the challenge of distinguishing target from non-target fixations during free visual search in realistic visual scenes. We propose a multimodal intent recognition method integrating eye-tracking and electroencephalography (EEG), extracting features including fixation duration, saccade-related potentials (SRPs), and neural oscillatory activity to construct a cross-subject classification model. To our knowledge, this is the first work to achieve online fixation intent discrimination during naturalistic free visual search—specifically in ecologically valid tasks such as desktop icon search and tool localization in cluttered industrial environments—thereby overcoming the limitations of prior studies that rely on abstract stimuli and constrained gaze trajectories. The proposed model achieves a cross-subject classification accuracy of 83.6%, significantly outperforming an SRP-only baseline (56.9%), demonstrating strong generalizability and robustness in complex, real-world settings.
📝 Abstract
Distinguishing target from non-target fixations during visual search is a fundamental building block to understand users' intended actions and to build effective assistance systems. While prior research indicated the feasibility of classifying target vs. non-target fixations based on eye tracking and electroencephalography (EEG) data, these studies were conducted with explicitly instructed search trajectories, abstract visual stimuli, and disregarded any scene context. This is in stark contrast with the fact that human visual search is largely driven by scene characteristics and raises questions regarding generalizability to more realistic scenarios. To close this gap, we, for the first time, investigate the classification of target vs. non-target fixations during free visual search in realistic scenes. In particular, we conducted a 36-participants user study using a large variety of 140 realistic visual search scenes in two highly relevant application scenarios: searching for icons on desktop backgrounds and finding tools in a cluttered workshop. Our approach based on gaze and EEG features outperforms the previous state-of-the-art approach based on a combination of fixation duration and saccade-related potentials. We perform extensive evaluations to assess the generalizability of our approach across scene types. Our approach significantly advances the ability to distinguish between target and non-target fixations in realistic scenarios, achieving 83.6% accuracy in cross-user evaluations. This substantially outperforms previous methods based on saccade-related potentials, which reached only 56.9% accuracy.