π€ AI Summary
This work addresses the challenges of semantic distortion and structural inconsistency in generating images from complex textual prompts involving multiple objects with specified attributes, quantities, and spatial relationships. The authors propose a scene graphβbased, zero-shot soft visual guidance mechanism that leverages a lightweight language model during inference to produce conditional signals that steer a diffusion model. The key innovation lies in the ASQL Conditioner module, which enables the first unified zero-shot conditioning framework jointly modeling Attribute, Size, Quantity, and Location. This approach significantly enhances semantic fidelity and structural coherence in generated images under complex prompts while preserving output diversity and computational efficiency.
π Abstract
Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.