In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing diffusion models struggle to simultaneously preserve subject identity fidelity and ensure precise text-semantic alignment in multimodal-guided custom subject insertion. This paper proposes a zero-shot, fine-tuning-free cross-modal in-context learning framework: given only a user-provided object image and a textual prompt as joint exemplars, it guides a pretrained MMDiT inpainting network to perform semantically coherent, high-fidelity reconstruction within masked regions. Our core innovation lies in dual-level latent-space manipulation— intra-head feature shifting to enhance identity consistency, and inter-head attention reweighting to improve text controllability. The method requires no training, no auxiliary data, and is fully plug-and-play. Extensive experiments demonstrate that our approach comprehensively surpasses current state-of-the-art methods in subject identity fidelity, text-alignment accuracy, and overall image quality.

Technology Category

Application Category

📝 Abstract

Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly"brushes"user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose"In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head"latent feature shifting"within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head"attention reweighting"across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.

Problem

Research questions and friction points this paper is trying to address.

Insert customized subjects into images with high fidelity

Align inserted subjects with user intent via text prompts

Achieve zero-shot customization without model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot framework for subject insertion

Dual-level latent space manipulation

MMDiT-based inpainting network enhancement

🔎 Similar Papers

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance