PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing visual saliency models neglect subjective cognitive diversity, hindering accurate modeling of individualized gaze behavior; low-resolution saliency maps compromise spatial precision, while multimodal large language models (MLLMs) suffer from hallucination and poor localization in multi-point prediction. To address these challenges in advertising video analysis, we propose C-GRPO—a reinforcement learning policy that jointly integrates fine-grained user profiling (e.g., demographics, preferences, cognitive traits) with MLLMs to enable end-to-end, high-resolution gaze point sequence prediction. Our framework is the first to achieve cognitively grounded personalization, format-compliant output, and spatially precise multi-point prediction. We introduce SPA-ADV, a large-scale eye-tracking dataset specifically curated for advertising videos, and demonstrate substantial improvements over state-of-the-art methods across multiple benchmarks. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at href{https://github.com/mininglamp-MLLM/PRE-MAP}{this URL}.

Problem

Research questions and friction points this paper is trying to address.

Captures personalized attention patterns in visual stimuli

Improves saliency prediction with high-resolution multi-attribute data

Ensures accurate point prediction in multimodal LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning-optimized eye-tracking for personalization

Multi-Attribute user profiles guiding point prediction

Consistency Group Relative Policy Optimization for accuracy

🔎 Similar Papers

Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models