InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Text-to-image diffusion models often struggle to accurately encode spatial relationships specified in prompts, primarily due to the absence of fine-grained spatial supervision in training data and the limited capacity of text embeddings to represent spatial semantics. This paper proposes a training-free, inference-time spatial alignment method that dynamically recalibrates noise during denoising using multi-level cross-attention maps, jointly optimizing object localization accuracy and existence completeness. Key contributions include: (1) the first plug-and-play, fine-tuning-free, backbone-agnostic spatial alignment mechanism; and (2) a composite attention loss integrating multi-scale spatial constraints. Evaluated on VISOR and T2I-CompBench, our method achieves new state-of-the-art performance—significantly outperforming existing inference-time approaches and surpassing leading fine-tuning-based methods.

Technology Category

Application Category

📝 Abstract

Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

Problem

Research questions and friction points this paper is trying to address.

Improves spatial alignment in text-to-image diffusion models.

Addresses inaccurate object placement from text prompts.

Enhances spatial semantics without retraining the model.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free inference-time spatial alignment method

Adjusts noise via compound loss in denoising steps

Uses cross-attention maps for object placement and presence

🔎 Similar Papers

No similar papers found.