TP‑Blend: Textual‑Prompt Attention Pairing for Precise Object‑Style Blending in Diffusion Models

📅 2026-01-12
🏛️ Trans. Mach. Learn. Res.
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes TP-Blend, a lightweight, training-free framework that addresses the longstanding challenge in text-guided diffusion-based image editing: the simultaneous and precise injection of novel objects and styles. TP-Blend achieves decoupled control by synergistically combining Cross-Attention Object Fusion (CAOF) and Self-Attention Style Fusion (SASF) within a single denoising step, enabling concurrent manipulation of content and texture. The method further enhances fidelity through entropy-regularized optimal transport, detail-sensitive instance normalization, and high–low frequency decomposition, effectively preserving multi-head feature correlations while boosting fine-grained detail retention. Extensive experiments demonstrate that TP-Blend significantly outperforms existing approaches in high-resolution image generation, achieving state-of-the-art performance in content fidelity, perceptual quality, and inference speed.

Technology Category

Application Category

📝 Abstract
Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.
Problem

Research questions and friction points this paper is trying to address.

object-style blending
text-conditioned diffusion models
simultaneous object and style editing
precise content-appearance control
Innovation

Methods, ideas, or system contributions that make the work stand out.

TP-Blend
Cross-Attention Object Fusion
Self-Attention Style Fusion
optimal transport
Detail-Sensitive Instance Normalization