🤖 AI Summary
Existing personalized generative models struggle to accurately capture user preferences and are typically confined to unimodal outputs, failing to meet the demands of real-world multimodal interaction scenarios. This work proposes DPPMG, a two-stage framework that first employs modality-specific graph neural networks to learn and quantize user preferences into discrete tokens, which are then injected into both text and image generators. A cross-modal consistent personalized reward mechanism is further designed to enable reinforcement fine-tuning. DPPMG establishes the first discrete preference learning paradigm tailored for personalized multimodal generation, effectively bridging the gap between continuous preference modeling and the discrete inputs required by generative models. Experimental results on two real-world datasets demonstrate that DPPMG significantly enhances both the personalization quality and cross-modal consistency of generated content.
📝 Abstract
The emergence of generative models enables the creation of texts and images tailored to users' preferences. Existing personalized generative models have two critical limitations: lacking a dedicated paradigm for accurate preference modeling, and generating unimodal content despite real-world multimodal-driven user interactions. Therefore, we propose personalized multimodal generation, which captures modal-specific preferences via a dedicated preference model from multimodal interactions, and then feeds them into downstream generators for personalized multimodal content. However, this task presents two challenges: (1) Gap between continuous preferences from dedicated modeling and discrete token inputs intrinsic to generator architectures; (2) Potential inconsistency between generated images and texts. To tackle these, we present a two-stage framework called Discrete Preference learning for Personalized Multimodal Generation (DPPMG). In the first stage, to accurately learn discrete modal-specific preferences, we introduce a modal-specific graph neural network (a dedicated preference model) to learn users' modal-specific preferences, which preferences are then quantized into discrete preference tokens. In the second stage, the discrete modal-specific preference tokens are injected into downstream text and image generators. To further enhance cross-modal consistency while preserving personalization, we design a cross-modal consistent and personalized reward to fine-tune token-associated parameters. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model in generating personalized and consistent multimodal content.