🤖 AI Summary
This work addresses a critical security vulnerability in vision-language model (VLM)-based gaze prediction systems, which currently lack robust defenses against backdoor attacks. We propose the first variable-output backdoor attack tailored to VLM-based gaze prediction, introducing an input-aware dual mechanism that jointly manipulates spatial fixation locations and dwell durations. This approach enables the model to maintain normal performance on clean inputs while reliably generating attacker-specified, realistic gaze trajectories when triggered. We implement data poisoning using visual, textual, and multimodal triggers on GazeFormer with the COCO-Search18 dataset, demonstrating the attack’s effectiveness across varying poisoning ratios, trigger modalities, and real-world deployment scenarios—including both legacy and modern smartphones. Experimental results reveal that five state-of-the-art post-training defense methods fail to simultaneously preserve clean accuracy and suppress the backdoor, underscoring the practical threat and significant defensive challenges posed by this attack.
📝 Abstract
Scanpath prediction models forecast the sequence and timing of human fixations during visual search, driving foveated rendering and attention-based interaction in mobile systems where their integrity is a first-class security concern. We present the first study of backdoor attacks against VLM-based scanpath prediction, evaluated on GazeFormer and COCO-Search18. We show that naive fixed-path attacks, while effective, create detectable clustering in the continuous output space. To overcome this, we design two variable-output attacks: an input-aware spatial attack that redirects predicted fixations toward an attacker-chosen target object, and a scanpath duration attack that inflates fixation durations to delay visual search completion. Both attacks condition their output on the input scene, producing diverse and plausible scanpaths that evade cluster-based detection. We evaluate across three trigger modalities (visual, textual, and multimodal), multiple poisoning ratios, and five post-training defenses, finding that no defense simultaneously suppresses the attacks and preserves clean performance across all configurations. We further demonstrate that backdoor behavior survives quantization and deployment on both flagship and legacy commodity smartphones, confirming practical threat viability for edge-deployed gaze-driven systems.