Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion models often fail to fully realize all specified concepts in multi-instance generation scenarios. To address this limitation, this work proposes Delta-K, a backbone-agnostic, plug-and-play inference framework that injects semantic discrepancy signals—denoted as ΔK and extracted by a vision-language model—into the shared cross-attention key space for the first time. Coupled with a dynamic scheduling mechanism, Delta-K enhances the semantic representation of missing concepts during early diffusion steps. Notably, this approach requires no architectural modifications, additional training, or spatial masks, yet significantly improves compositional alignment and instance completeness across diverse backbones such as DiT and U-Net in multi-instance image generation.

Technology Category

Application Category

📝 Abstract
While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key $\Delta K$ that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.
Problem

Research questions and friction points this paper is trying to address.

concept omission
multi-instance generation
diffusion models
text-to-image synthesis
cross-attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Delta-K
cross-attention augmentation
multi-instance generation
diffusion models
concept omission
🔎 Similar Papers
No similar papers found.
Z
Zitong Wang
School of Software Engineering, Sun Yat-sen University
Z
Zijun Shen
Nanjing University
H
Haohao Xu
College of Management and Economics, Tianjin University
Z
Zhengjie Luo
School of Software Engineering, Sun Yat-sen University
Weibin Wu
Weibin Wu
Sun Yat-sen University
Trustworthy Machine Learning