Personalized Image Generation with Large Multimodal Models

📅 2024-10-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing key challenges in personalized image generation—including difficulty in modeling visual preferences, scarcity of training data, and weak multimodal instruction understanding—this paper introduces Pigeon, a three-module large vision-language model integrating noise-robust representation learning, multimodal instruction comprehension, and unsupervised preference alignment. Its core innovation is a two-stage preference alignment mechanism—masked preference reconstruction and pairwise preference alignment—that precisely aligns user interaction history with generation objectives without requiring human annotations. Evaluated on personalized sticker and movie poster generation tasks, Pigeon significantly outperforms state-of-the-art baselines. Quantitative metrics (FID, CLIPScore) and human evaluations jointly demonstrate superior performance in generation accuracy, diversity, and user preference consistency.

Technology Category

Application Category

📝 Abstract
Personalized content filtering, such as recommender systems, has become a critical infrastructure to alleviate information overload. However, these systems merely filter existing content and are constrained by its limited diversity, making it difficult to meet users' varied content needs. To address this limitation, personalized content generation has emerged as a promising direction with broad applications. Nevertheless, most existing research focuses on personalized text generation, with relatively little attention given to personalized image generation. The limited work in personalized image generation faces challenges in accurately capturing users' visual preferences and needs from noisy user-interacted images and complex multimodal instructions. Worse still, there is a lack of supervised data for training personalized image generation models. To overcome the challenges, we propose a Personalized Image Generation Framework named Pigeon, which adopts exceptional large multimodal models with three dedicated modules to capture users' visual preferences and needs from noisy user history and multimodal instructions. To alleviate the data scarcity, we introduce a two-stage preference alignment scheme, comprising masked preference reconstruction and pairwise preference alignment, to align Pigeon with the personalized image generation task. We apply Pigeon to personalized sticker and movie poster generation, where extensive quantitative results and human evaluation highlight its superiority over various generative baselines.
Problem

Research questions and friction points this paper is trying to address.

Personalized Image Generation
Complex Instruction Understanding
Insufficient Training Data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pigeon System
Personalized Image Generation
Two-Stage Method
🔎 Similar Papers
No similar papers found.
Yiyan Xu
Yiyan Xu
University of Science and Technology of China
Personalized GenerationGenerative AIGenerative Recommendation
W
Wenjie Wang
National University of Singapore, Singapore, Singapore
Y
Yang Zhang
National University of Singapore, Singapore, Singapore
T
Tang Biao
Meituan, China
Peng Yan
Peng Yan
Research Assistant of ZHAW, PhD student of UZH
Deep LearningTransfer LearningIntelligent Algorithm
F
Fuli Feng
University of Science and Technology of China, China, Hefei
Xiangnan He
Xiangnan He
University of Science and Technology of China
RecommendationCausalityBig DataInformation RetrievalMachine Learning