Personalized Image Descriptions from Attention Sequences

📅 2025-12-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing personalized image captioning methods model only linguistic style, neglecting the influence of individual visual perception patterns—such as attention regions, object preferences, and viewing order—on caption generation. This work proposes DEPER, the first model to explicitly incorporate personalized visual attention sequences as core input, jointly modeling perceptual behavior and linguistic expression. Methodologically, DEPER employs an auxiliary attention prediction task to learn user-specific embeddings and fine-tunes a frozen vision-language large model via lightweight adapters, enabling few-shot personalization. Evaluated on four benchmark datasets, DEPER achieves an average 24% improvement in standard metrics (e.g., CIDEr), significantly enhancing caption naturalness, diversity, and human alignment. These results empirically validate the critical role of perceptual personalization in multimodal generative tasks.

Technology Category

Application Category

📝 Abstract
People can view the same image differently: they focus on different regions, objects, and details in varying orders and describe them in distinct linguistic styles. This leads to substantial variability in image descriptions. However, existing models for personalized image description focus on linguistic style alone, with no prior work leveraging individual viewing patterns. We address this gap by explicitly modeling personalized viewing behavior as a core factor in description generation. Our method, DEPER (DEscription-PERception persona encoder), learns a subject embedding that captures both linguistic style and viewing behavior, guided by an auxiliary attention-prediction task. A lightweight adapter aligns these embeddings with a frozen vision-language model, enabling few-shot personalization without retraining. Across four datasets spanning diverse viewing tasks and both short and detailed descriptions, DEPER achieves a 24% average improvement, showing that modeling personalized attention produces more human-aligned and high-quality descriptions. We posit that understanding how people see helps predict what they say; modeling human diversity in perception can improve both performance and human alignment in multimodal systems.
Problem

Research questions and friction points this paper is trying to address.

Models personalized viewing behavior for image descriptions
Integrates linguistic style and attention patterns in descriptions
Enables few-shot personalization without retraining vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns personalized linguistic style and viewing behavior embeddings
Uses lightweight adapter for few-shot personalization without retraining
Models personalized attention to improve description quality and human alignment
🔎 Similar Papers
No similar papers found.