ESPLoRA: Enhanced Spatial Precision with Low-Rank Adaption in Text-to-Image Diffusion Models for High-Definition Synthesis

๐Ÿ“… 2025-04-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Text-to-image (T2I) diffusion models exhibit significant biases in modeling spatial relations (e.g., โ€œleft/rightโ€, โ€œin front of/behindโ€). Existing approaches rely on external conditioning networks or predefined layouts, incurring high computational overhead and limited flexibility. To address this, we propose ESPLoRA, a lightweight fine-tuning framework based on LoRA that requires no auxiliary networks or layout inputs. Our contributions are threefold: (1) the first spatially explicit prompt dataset; (2) the TORE algorithm, which identifies and rectifies inherent spatial biases in pretrained diffusion models; and (3) a novel geometric constraint-based evaluation metric tailored for 3D spatial relations. Experiments demonstrate that ESPLoRA achieves a 13.33% absolute improvement over the state-of-the-art CoMPaSS on spatial consistency benchmarks, while preserving native inference speed and image fidelity.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion models have revolutionized text-to-image (T2I) synthesis, producing high-quality, photorealistic images. However, they still struggle to properly render the spatial relationships described in text prompts. To address the lack of spatial information in T2I generations, existing methods typically use external network conditioning and predefined layouts, resulting in higher computational costs and reduced flexibility. Our approach builds upon a curated dataset of spatially explicit prompts, meticulously extracted and synthesized from LAION-400M to ensure precise alignment between textual descriptions and spatial layouts. Alongside this dataset, we present ESPLoRA, a flexible fine-tuning framework based on Low-Rank Adaptation, specifically designed to enhance spatial consistency in generative models without increasing generation time or compromising the quality of the outputs. In addition to ESPLoRA, we propose refined evaluation metrics grounded in geometric constraints, capturing 3D spatial relations such as extit{in front of} or extit{behind}. These metrics also expose spatial biases in T2I models which, even when not fully mitigated, can be strategically exploited by our TORE algorithm to further improve the spatial consistency of generated images. Our method outperforms the current state-of-the-art framework, CoMPaSS, by 13.33% on established spatial consistency benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Improves spatial accuracy in text-to-image diffusion models
Reduces computational costs without sacrificing output quality
Addresses spatial biases with new evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Low-Rank Adaptation for spatial precision
Introduces curated dataset for spatial alignment
Proposes TORE algorithm to exploit spatial biases
๐Ÿ”Ž Similar Papers
No similar papers found.