π€ AI Summary
This work addresses the persistent challenge of accurately modeling complex spatial relationships in text-to-image generation, where existing models often require multiple sampling attempts to produce plausible outputs. To tackle this limitation, the authors introduce SpatialReward-Dataset, a large-scale dataset comprising over 80,000 human-annotated preference pairs focused on spatial configurations, and leverage it to train SpatialScoreβa specialized reward model designed to evaluate the fidelity of spatial relations in generated images. The proposed approach integrates SpatialScore into an online reinforcement learning framework to directly optimize the generative process. Experimental results demonstrate that this method consistently and significantly improves spatial reasoning and generation accuracy across multiple benchmarks, outperforming leading closed-source models in both qualitative and quantitative evaluations.
π Abstract
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.