DynaIP: Dynamic Image Prompt Adapter for Scalable Zero-shot Personalized Text-to-Image Generation

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing personalized text-to-image (T2I) methods face three key challenges: trade-offs between concept preservation (CP) and prompt following (PF), loss of fine-grained visual details, and poor scalability to multi-subject generation. This paper introduces DynaIP—a zero-shot, fine-tuning-free dynamic image prompt adapter. Its core contributions are: (1) a dynamic decoupling strategy that explicitly suppresses concept-irrelevant interference, balancing CP and PF; (2) a Mixture-of-Experts (MoE)-based weighted fusion module leveraging hierarchical CLIP features to enhance fine-grained fidelity and multi-subject scalability; and (3) a cross-attention injection mechanism compatible with MM-DiT. Evaluated on single- and multi-subject benchmarks, DynaIP outperforms state-of-the-art methods across all metrics—significantly improving concept consistency, prompt adherence, and detail reconstruction—while enabling flexible, controllable compositional generation of multiple subjects.

Technology Category

Application Category

📝 Abstract
Personalized Text-to-Image (PT2I) generation aims to produce customized images based on reference images. A prominent interest pertains to the integration of an image prompt adapter to facilitate zero-shot PT2I without test-time fine-tuning. However, current methods grapple with three fundamental challenges: 1. the elusive equilibrium between Concept Preservation (CP) and Prompt Following (PF), 2. the difficulty in retaining fine-grained concept details in reference images, and 3. the restricted scalability to extend to multi-subject personalization. To tackle these challenges, we present Dynamic Image Prompt Adapter (DynaIP), a cutting-edge plugin to enhance the fine-grained concept fidelity, CP-PF balance, and subject scalability of SOTA T2I multimodal diffusion transformers (MM-DiT) for PT2I generation. Our key finding is that MM-DiT inherently exhibit decoupling learning behavior when injecting reference image features into its dual branches via cross attentions. Based on this, we design an innovative Dynamic Decoupling Strategy that removes the interference of concept-agnostic information during inference, significantly enhancing the CP-PF balance and further bolstering the scalability of multi-subject compositions. Moreover, we identify the visual encoder as a key factor affecting fine-grained CP and reveal that the hierarchical features of commonly used CLIP can capture visual information at diverse granularity levels. Therefore, we introduce a novel Hierarchical Mixture-of-Experts Feature Fusion Module to fully leverage the hierarchical features of CLIP, remarkably elevating the fine-grained concept fidelity while also providing flexible control of visual granularity. Extensive experiments across single- and multi-subject PT2I tasks verify that our DynaIP outperforms existing approaches, marking a notable advancement in the field of PT2l generation.
Problem

Research questions and friction points this paper is trying to address.

Balancing concept preservation and prompt following in zero-shot PT2I
Retaining fine-grained details from reference images in generation
Enhancing scalability for multi-subject personalized image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Decoupling Strategy removes concept-agnostic interference
Hierarchical Mixture-of-Experts fuses CLIP features for fine-grained fidelity
Plugin enhances MM-DiT for multi-subject scalability without fine-tuning
🔎 Similar Papers
No similar papers found.
Zhizhong Wang
Zhizhong Wang
Researcher, Huawei
Generative ModelsMulti-ModalStyle Transfer
T
Tianyi Chu
Central Media Technology Institute, Huawei
Z
Zeyi Huang
Central Media Technology Institute, Huawei
N
Nanyang Wang
Central Media Technology Institute, Huawei
Kehan Li
Kehan Li
Stanford University