đ¤ AI Summary
Prior to this work, tokenization strategies for gaze dataâcritical for integrating eye-tracking signals into large language models (LLMs) and multimodal LLMs (MLLMs)âremained unexplored, leaving a fundamental gap in modality-specific preprocessing. Method: We systematically evaluate five tokenization approachesâquantile binning, k-means clustering, and linear/logarithmic/adaptive binningâaccounting for gaze sequencesâ continuity and statistical heterogeneity. Leveraging pre-trained MLLMsâ visual encoders, we discretize gaze trajectories into learnable tokens compatible with LLM architectures. Contribution/Results: Quantile binning achieves lowest error in spatial position prediction, while k-means excels in velocity predictionâdemonstrating strong dependence of tokenization efficacy on gaze distribution characteristics. Experiments across three benchmark eye-tracking datasets show consistent improvements in reconstruction accuracy, compression ratio, and downstream LLM performance. This work establishes the first principled framework for gaze tokenization, enabling effective integration of biological signals into multimodal foundation models.
đ Abstract
A considerable part of the performance of today's large language models (LLM's) and multimodal large language models (MLLM's) depends on their tokenization strategies. While tokenizers are extensively researched for textual and visual input, there is no research on tokenization strategies for gaze data due to its nature. However, a corresponding tokenization strategy would allow using the vision capabilities of pre-trained MLLM's for gaze data, for example, through fine-tuning. In this paper, we aim to close this research gap by analyzing five different tokenizers for gaze data on three different datasets for the forecasting and generation of gaze data through LLMs (cf.~cref{fig:teaser}). We evaluate the tokenizers regarding their reconstruction and compression abilities. Further, we train an LLM for each tokenization strategy, measuring its generative and predictive performance. Overall, we found that a quantile tokenizer outperforms all others in predicting the gaze positions and k-means is best when predicting gaze velocities.