🤖 AI Summary
Existing text-to-image personalization methods struggle to simultaneously achieve concept fidelity and semantic consistency under specific prompts and noise seeds. To address this, we propose a novel query-level fine-grained concept learning framework that jointly optimizes self-attention and cross-attention via a dual-loss mechanism, guided by Prompt-Diffusion Matching (PDM) features to explicitly model identity characteristics of novel visual concepts. Our method is architecture-agnostic—compatible with both UNet and DiT backbones—and supports end-to-end diffusion model fine-tuning. We conduct comprehensive evaluations across six state-of-the-art baselines and multiple foundational models. Results demonstrate significant improvements over existing per-query personalization approaches in generation quality, concept accuracy, and cross-prompt generalization. This advancement enables more robust and controllable generation for applications such as personalized design and product embedding.
📝 Abstract
Visual concept learning, also known as Text-to-image personalization, is the process of teaching new concepts to a pretrained model. This has numerous applications from product placement to entertainment and personalized design. Here we show that many existing methods can be substantially augmented by adding a personalization step that is (1) specific to the prompt and noise seed, and (2) using two loss terms based on the self- and cross- attention, capturing the identity of the personalized concept. Specifically, we leverage PDM features - previously designed to capture identity - and show how they can be used to improve personalized semantic similarity. We evaluate the benefit that our method gains on top of six different personalization methods, and several base text-to-image models (both UNet- and DiT-based). We find significant improvements even over previous per-query personalization methods.