🤖 AI Summary
Current text-to-image models frequently suffer from imprecise prompt alignment, resulting in missing key elements or semantic confusion. To address this, we propose Signal-Aligned Distribution Learning (SADL), a framework that explicitly models the semantic signal component during diffusion and flow-matching denoising processes—enabling fine-grained, architecture-agnostic prompt control without additional training. SADL supports multimodal conditioning inputs—including text and bounding boxes—while avoiding over-optimization and out-of-distribution artifacts. Evaluated on MSCOCO and RefCOCO benchmarks, our method significantly improves semantic fidelity and spatial structural consistency of generated images, achieving state-of-the-art performance across all major metrics. The implementation is publicly available.
📝 Abstract
State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.