What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a key limitation in existing eye movement scanpath similarity metrics, which predominantly rely on spatial and temporal alignment while neglecting semantic equivalence between fixated regions. To overcome this, the study introduces visual language models (VLMs) to generate context-aware textual descriptions for individual fixations—either via image patches or token-based strategies—and aggregates these into a scanpath-level semantic representation. Semantic similarity is then computed using embedding- and lexicon-based natural language processing metrics. The resulting approach yields an interpretable, content-aware similarity measure that effectively complements traditional geometric alignment methods. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures information partially orthogonal to spatial alignment, revealing that scanpaths with divergent spatial distributions can nonetheless exhibit high semantic consistency.
📝 Abstract
Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.
Problem

Research questions and friction points this paper is trying to address.

scanpath similarity
eye-tracking
semantic equivalence
vision-language models
NLP metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic scanpath similarity
vision-language models
eye-tracking analysis
NLP metrics
multimodal foundation models
🔎 Similar Papers
No similar papers found.