Contextualized Visual Personalization in Vision-Language Models

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge that existing vision-language models struggle to leverage users’ historical visual-textual experiences to interpret personalized context in new images. The authors propose CoViP, a unified framework that formally introduces the task of contextualized visual personalization for the first time. By integrating reinforcement learning-based post-training and a description-augmented generation mechanism, CoViP enhances the model’s ability to recognize and utilize user-specific visual context. The study also designs a diagnostic evaluation protocol that explicitly mitigates textual shortcuts, ensuring robust assessment of personalization capabilities. Experimental results demonstrate significant performance gains on personalized image captioning and multiple downstream tasks, consistently outperforming both open-source and proprietary state-of-the-art models.

Technology Category

Application Category

📝 Abstract

Despite recent progress in vision-language models (VLMs), existing approaches often fail to generate personalized responses based on the user's specific experiences, as they lack the ability to associate visual inputs with a user's accumulated visual-textual context. We newly formalize this challenge as contextualized visual personalization, which requires the visual recognition and textual retrieval of personalized visual experiences by VLMs when interpreting new images. To address this issue, we propose CoViP, a unified framework that treats personalized image captioning as a core task for contextualized visual personalization and improves this capability through reinforcement-learning-based post-training and caption-augmented generation. We further introduce diagnostic evaluations that explicitly rule out textual shortcut solutions and verify whether VLMs truly leverage visual context. Extensive experiments demonstrate that existing open-source and proprietary VLMs exhibit substantial limitations, while CoViP not only improves personalized image captioning but also yields holistic gains across downstream personalization tasks. These results highlight CoViP as a crucial stage for enabling robust and generalizable contextualized visual personalization.

Problem

Research questions and friction points this paper is trying to address.

contextualized visual personalization

vision-language models

personalized image captioning

visual-textual context

personalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

contextualized visual personalization

personalized image captioning

vision-language models