Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing large language model–based user simulators struggle to faithfully replicate users’ visual attention behaviors in recommendation interfaces, resulting in insufficient simulation fidelity. To address this limitation, this work introduces FixATE, the first approach that aligns personalized user eye-tracking patterns with the internal attention mechanisms of vision-language models (VLMs) through personalized soft prompts, thereby enabling a “see-as-user” simulation paradigm. By integrating interpretable probing operators with real eye-tracking data, FixATE significantly enhances both attention alignment and click prediction accuracy across multiple VLM architectures, demonstrating its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) agents are increasingly deployed as scalable user simulators for recommender system evaluation. Yet existing simulators perceive recommendations through text or structured metadata rather than the visual interfaces real users browse-a critical gap, since attention over recommendation layouts is both visually driven and highly personalized. We investigate whether aligning a vision-language model's (VLM's) visual attention with user-specific gaze patterns can improve simulation fidelity. Analysis of a real-world eye-tracking dataset collected in a carousel-based recommendation setting reveals that users exhibit stable individual gaze patterns strongly predictive of click behavior. Building on this finding, we propose Fixation-Aligned Tuning for user Emulation (FixATE). Our approach first probes the VLM's internal visual attention via interpretability operators to obtain a slot-level relevance distribution comparable with human fixation, and then learns personalized soft prompts to steer the model's attention toward each user's characteristic fixation pattern. Experiments across three interpretability-based probing operators and two architecturally distinct VLM backbones demonstrate consistent improvements in both attention alignment and click prediction accuracy. These results suggest that making the model"see like the user"is a viable path toward simulators that more faithfully reproduce how users perceive and act in recommendation interfaces.

Problem

Research questions and friction points this paper is trying to address.

user simulation

visual attention

recommendation systems

gaze patterns

personalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fixation-Aligned Tuning

Vision-Language Model

Personalized User Emulation