🤖 AI Summary
Text-to-image diffusion models often struggle to accurately encode spatial relationships specified in prompts, primarily due to the absence of fine-grained spatial supervision in training data and the limited capacity of text embeddings to represent spatial semantics. This paper proposes a training-free, inference-time spatial alignment method that dynamically recalibrates noise during denoising using multi-level cross-attention maps, jointly optimizing object localization accuracy and existence completeness. Key contributions include: (1) the first plug-and-play, fine-tuning-free, backbone-agnostic spatial alignment mechanism; and (2) a composite attention loss integrating multi-scale spatial constraints. Evaluated on VISOR and T2I-CompBench, our method achieves new state-of-the-art performance—significantly outperforming existing inference-time approaches and surpassing leading fine-tuning-based methods.
📝 Abstract
Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.