InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Text-to-image diffusion models often struggle to accurately encode spatial relationships specified in prompts, primarily due to the absence of fine-grained spatial supervision in training data and the limited capacity of text embeddings to represent spatial semantics. This paper proposes a training-free, inference-time spatial alignment method that dynamically recalibrates noise during denoising using multi-level cross-attention maps, jointly optimizing object localization accuracy and existence completeness. Key contributions include: (1) the first plug-and-play, fine-tuning-free, backbone-agnostic spatial alignment mechanism; and (2) a composite attention loss integrating multi-scale spatial constraints. Evaluated on VISOR and T2I-CompBench, our method achieves new state-of-the-art performance—significantly outperforming existing inference-time approaches and surpassing leading fine-tuning-based methods.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
Problem

Research questions and friction points this paper is trying to address.

Improves spatial alignment in text-to-image diffusion models.
Addresses inaccurate object placement from text prompts.
Enhances spatial semantics without retraining the model.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free inference-time spatial alignment method
Adjusts noise via compound loss in denoising steps
Uses cross-attention maps for object placement and presence
🔎 Similar Papers
No similar papers found.