π€ AI Summary
This work addresses the challenge of accurately modeling usersβ fine-grained personalized preferences in text-to-image generation. To this end, we propose Premier, a novel framework that learns user-specific embeddings to capture individual preferences and introduces a preference adapter to dynamically fuse these embeddings with textual prompts, thereby enabling precise control over the generation process. A dispersion loss is incorporated to enhance the discriminability of the learned embeddings, and the framework supports few-shot generalization for new users through linear combinations of existing embeddings. Experimental results demonstrate that, given the same amount of historical user data, Premier significantly outperforms existing methods in terms of preference alignment, text-image consistency, ViPer scores, and expert evaluations.
π Abstract
Text-to-image generation has advanced rapidly, yet it still struggles to capture the nuanced user preferences. Existing approaches typically rely on multimodal large language models to infer user preferences, but the derived prompts or latent codes rarely reflect them faithfully, leading to suboptimal personalization. We present Premier, a novel preference modulation framework for personalized image generation. Premier represents each user's preference as a learnable embedding and introduces a preference adapter that fuses the user embedding with the text prompt. To enable accurate and fine-grained preference control, the fused preference embedding is further used to modulate the generative process. To enhance the distinctness of individual preference and improve alignment between outputs and user-specific styles, we incorporate a dispersion loss that enforces separation among user embeddings. When user data are scarce, new users are represented as linear combinations of existing preference embeddings learned during training, enabling effective generalization. Experiments show that Premier outperforms prior methods under the same history length, achieving stronger preference alignment and superior performance on text consistency, ViPer proxy metrics, and expert evaluations.