InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models often struggle to accurately encode spatial relationships specified in prompts, primarily due to the absence of fine-grained spatial supervision in training data and the limited capacity of text embeddings to represent spatial semantics. This paper proposes a training-free, inference-time spatial alignment method that dynamically recalibrates noise during denoising using multi-level cross-attention maps, jointly optimizing object localization accuracy and existence completeness. Key contributions include: (1) the first plug-and-play, fine-tuning-free, backbone-agnostic spatial alignment mechanism; and (2) a composite attention loss integrating multi-scale spatial constraints. Evaluated on VISOR and T2I-CompBench, our method achieves new state-of-the-art performance—significantly outperforming existing inference-time approaches and surpassing leading fine-tuning-based methods.

Technology Category

Application Category

📝 Abstract
Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
Problem

Research questions and friction points this paper is trying to address.

Improves spatial alignment in text-to-image diffusion models.
Addresses inaccurate object placement from text prompts.
Enhances spatial semantics without retraining the model.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free inference-time spatial alignment method
Adjusts noise via compound loss in denoising steps
Uses cross-attention maps for object placement and presence
🔎 Similar Papers
No similar papers found.
Sarah Rastegar
Sarah Rastegar
Universiteit van Amsterdam
Causal InferenceGeneralized Category Discovery
V
Violeta Chatalbasheva
Delft University of Technology, The Netherlands
S
Sieger Falkena
Shell Information Technology International
A
Anuj Singh
Delft University of Technology, The Netherlands
Y
Yanbo Wang
Delft University of Technology, The Netherlands
Tejas Gokhale
Tejas Gokhale
Assistant Professor, University of Maryland Baltimore County
Cognitive VisionVisual ReasoningConcept LearningAdversarial TrainingRobustness
Hamid Palangi
Hamid Palangi
Google and University of Washington
Artificial IntelligenceMachine LearningNatural Language Processing
H
Hadi Jamali-Rad
Delft University of Technology, The Netherlands, Shell Information Technology International