SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

📅 2025-08-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image models frequently suffer from imprecise prompt alignment, resulting in missing key elements or semantic confusion. To address this, we propose Signal-Aligned Distribution Learning (SADL), a framework that explicitly models the semantic signal component during diffusion and flow-matching denoising processes—enabling fine-grained, architecture-agnostic prompt control without additional training. SADL supports multimodal conditioning inputs—including text and bounding boxes—while avoiding over-optimization and out-of-distribution artifacts. Evaluated on MSCOCO and RefCOCO benchmarks, our method significantly improves semantic fidelity and spatial structural consistency of generated images, achieving state-of-the-art performance across all major metrics. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.
Problem

Research questions and friction points this paper is trying to address.

Addresses misalignment between text prompts and generated images
Mitigates missing elements and unintended concept blending
Enhances spatial alignment with additional conditioning modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning signal-aligned distributions for text-to-image
Explicitly modeling signal component during denoising
Training-free integration with diffusion and flow architectures
🔎 Similar Papers
No similar papers found.