๐ค AI Summary
Existing diffusion models struggle to simultaneously preserve subject identity fidelity and ensure precise text-semantic alignment in multimodal-guided custom subject insertion. This paper proposes a zero-shot, fine-tuning-free cross-modal in-context learning framework: given only a user-provided object image and a textual prompt as joint exemplars, it guides a pretrained MMDiT inpainting network to perform semantically coherent, high-fidelity reconstruction within masked regions. Our core innovation lies in dual-level latent-space manipulationโ intra-head feature shifting to enhance identity consistency, and inter-head attention reweighting to improve text controllability. The method requires no training, no auxiliary data, and is fully plug-and-play. Extensive experiments demonstrate that our approach comprehensively surpasses current state-of-the-art methods in subject identity fidelity, text-alignment accuracy, and overall image quality.
๐ Abstract
Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly"brushes"user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose"In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head"latent feature shifting"within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head"attention reweighting"across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.