🤖 AI Summary
Diffusion models often fail to fully realize all specified concepts in multi-instance generation scenarios. To address this limitation, this work proposes Delta-K, a backbone-agnostic, plug-and-play inference framework that injects semantic discrepancy signals—denoted as ΔK and extracted by a vision-language model—into the shared cross-attention key space for the first time. Coupled with a dynamic scheduling mechanism, Delta-K enhances the semantic representation of missing concepts during early diffusion steps. Notably, this approach requires no architectural modifications, additional training, or spatial masks, yet significantly improves compositional alignment and instance completeness across diverse backbones such as DiT and U-Net in multi-instance image generation.
📝 Abstract
While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key $\Delta K$ that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.