Nested Attention: Semantic-aware Attention Values for Concept Personalization

πŸ“… 2025-01-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current text-to-image models struggle to simultaneously preserve subject identity fidelity and maintain text alignment in personalized subject generation: single-token embeddings lack sufficient representational capacity, while aggressive fine-tuning disrupts pretrained priors, causing semantic drift. To address this, we propose a query-dependent nested attention mechanism that dynamically injects subject features into cross-attention layers, enabling region-adaptive feature selection. Our approach combines lightweight encoder-side fine-tuning with semantic-aware attention-value modeling to preserve pretrained knowledge. Notably, it is the first method to enable multi-subject cross-domain co-occurrence (e.g., β€œan astronaut riding a panda”). Quantitatively, it achieves significant improvements in identity consistency (+23.6% ID retention rate) and prompt adherence (CLIP-Score +18.4%), while demonstrating strong generalization across diverse styles and scenes.

Technology Category

Application Category

πŸ“ Abstract
Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model's prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model's existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.
Problem

Research questions and friction points this paper is trying to address.

Generative Models
Object Characteristics
Text-to-Image Discrepancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Nested Attention
Multi-level Feature Selection
Cross-domain Customization
πŸ”Ž Similar Papers
No similar papers found.