In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation

๐Ÿ“… 2025-05-26
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing diffusion models struggle to simultaneously preserve subject identity fidelity and ensure precise text-semantic alignment in multimodal-guided custom subject insertion. This paper proposes a zero-shot, fine-tuning-free cross-modal in-context learning framework: given only a user-provided object image and a textual prompt as joint exemplars, it guides a pretrained MMDiT inpainting network to perform semantically coherent, high-fidelity reconstruction within masked regions. Our core innovation lies in dual-level latent-space manipulationโ€” intra-head feature shifting to enhance identity consistency, and inter-head attention reweighting to improve text controllability. The method requires no training, no auxiliary data, and is fully plug-and-play. Extensive experiments demonstrate that our approach comprehensively surpasses current state-of-the-art methods in subject identity fidelity, text-alignment accuracy, and overall image quality.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly"brushes"user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user's intent through textual prompts. In this work, we propose"In-Context Brush", a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head"latent feature shifting"within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head"attention reweighting"across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection.
Problem

Research questions and friction points this paper is trying to address.

Insert customized subjects into images with high fidelity
Align inserted subjects with user intent via text prompts
Achieve zero-shot customization without model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot framework for subject insertion
Dual-level latent space manipulation
MMDiT-based inpainting network enhancement
๐Ÿ”Ž Similar Papers
No similar papers found.
Yu Xu
Yu Xu
University of Cambridge
Multi-omicsHealth Data ScienceData MiningSocial NetworkWeb Services
F
Fan Tang
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Y
You Wu
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
L
Lin Gao
Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Oliver Deussen
Oliver Deussen
Professor of Computer Science, University of Konstanz
Computer GraphicsVisualizationModelling
Hongbin Yan
Hongbin Yan
Associate Professor, Nanjing University of Science and Technology
Convective flow and heat transferThermal management and protection
J
Jintao Li
Institute of Computing Technology, Chinese Academy of Sciences
Juan Cao
Juan Cao
Professor of Mathematics, Xiamen University
Computer Aided Geometric DesignComputer Graphics
Tong-Yee Lee
Tong-Yee Lee
National Cheng-Kung University
computer graphicsvisualizationVirtual RealitymultimediaAI and Deep learning