🤖 AI Summary
Existing personalized image generation methods struggle to accurately model users’ subtle aesthetic preferences and lack effective mechanisms for preference feature disentanglement and cross-modal alignment. To address this, we propose a unified generation framework that jointly incorporates user preferences and textual prompts. Our method introduces a dual-discriminative probing module to disentangle inter-user variability from intra-user consistent preferences. We further design preference-guided visual question answering pretraining and a maximum mean discrepancy (MMD)-based cross-modal alignment loss to enable joint optimization of text–image–preference ternary features. The framework is implemented as an end-to-end pipeline integrating a multimodal large language model with a diffusion model. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in both image fidelity and preference consistency, validating the effectiveness of our preference disentanglement and cross-modal alignment strategies.
📝 Abstract
Preference-conditioned image generation seeks to adapt generative models to individual users, producing outputs that reflect personal aesthetic choices beyond the given textual prompt. Despite recent progress, existing approaches either fail to capture nuanced user preferences or lack effective mechanisms to encode personalized visual signals. In this work, we propose a multimodal framework that leverages multimodal large language models (MLLMs) to extract rich user representations and inject them into diffusion-based image generation. We train the MLLM with a preference-oriented visual question answering task to capture fine-grained semantic cues. To isolate preference-relevant features, we introduce two complementary probing tasks: inter-user discrimination to distinguish between different users, and intra-user discrimination to separate liked from disliked content. To ensure compatibility with diffusion text encoders, we design a maximum mean discrepancy-based alignment loss that bridges the modality gap while preserving multimodal structure. The resulting embeddings are used to condition the generator, enabling faithful adherence to both prompts and user preferences. Extensive experiments demonstrate that our method substantially outperforms strong baselines in both image quality and preference alignment, highlighting the effectiveness of representation extraction and alignment for personalized generation.