SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation

📅 2025-08-19

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Current text-to-image models frequently suffer from imprecise prompt alignment, resulting in missing key elements or semantic confusion. To address this, we propose Signal-Aligned Distribution Learning (SADL), a framework that explicitly models the semantic signal component during diffusion and flow-matching denoising processes—enabling fine-grained, architecture-agnostic prompt control without additional training. SADL supports multimodal conditioning inputs—including text and bounding boxes—while avoiding over-optimization and out-of-distribution artifacts. Evaluated on MSCOCO and RefCOCO benchmarks, our method significantly improves semantic fidelity and spatial structural consistency of generated images, achieving state-of-the-art performance across all major metrics. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

State-of-the-art text-to-image models produce visually impressive results but often struggle with precise alignment to text prompts, leading to missing critical elements or unintended blending of distinct concepts. We propose a novel approach that learns a high-success-rate distribution conditioned on a target prompt, ensuring that generated images faithfully reflect the corresponding prompts. Our method explicitly models the signal component during the denoising process, offering fine-grained control that mitigates over-optimization and out-of-distribution artifacts. Moreover, our framework is training-free and seamlessly integrates with both existing diffusion and flow matching architectures. It also supports additional conditioning modalities -- such as bounding boxes -- for enhanced spatial alignment. Extensive experiments demonstrate that our approach outperforms current state-of-the-art methods. The code is available at https://github.com/grimalPaul/gsn-factory.

Problem

Research questions and friction points this paper is trying to address.

Addresses misalignment between text prompts and generated images

Mitigates missing elements and unintended concept blending

Enhances spatial alignment with additional conditioning modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning signal-aligned distributions for text-to-image

Explicitly modeling signal component during denoising

Training-free integration with diffusion and flow architectures

🔎 Similar Papers

No similar papers found.