🤖 AI Summary
Existing methods for personalized image generation uniformly model user historical sequences, ignoring semantic discrepancies between historical items and reference images, while over-relying on pixel- or feature-level consistency constraints—leading to inaccurate preference modeling and insufficient personalization. To address this, we propose a retrieval-augmented, differential historical weighting mechanism that dynamically integrates semantically similar historical images. Instead of enforcing rigid consistency constraints, we introduce a learnable multimodal ranking task to jointly optimize user preference alignment and semantic fidelity. Our approach integrates multimodal retrieval, contrastive learning, cross-modal ranking, and diffusion model fine-tuning. Evaluated on three real-world datasets, our method achieves state-of-the-art performance, significantly outperforming five baselines on key metrics including User Preference Score and CLIP-Score.
📝 Abstract
Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historical items and the reference item. Disproportionately high weights for low-similarity items distort users' visual preferences for the reference item. Second, existing methods heavily rely on consistency between generated and reference images to optimize the generation, which leads to underfitting user preferences and hinders personalization. To address these issues, we propose Retrieval Augment Personalized Image GenerAtion guided by Recommendation (RAGAR). Our approach uses a retrieval mechanism to assign different weights to historical items according to their similarities to the reference item, thereby extracting more refined users' visual preferences for the reference item. Then we introduce a novel rank task based on the multi-modal ranking model to optimize the personalization of the generated images instead of forcing depend on consistency. Extensive experiments and human evaluations on three real-world datasets demonstrate that RAGAR achieves significant improvements in both personalization and semantic metrics compared to five baselines.