Self-Corrected Image Generation with Explainable Latent Rewards

πŸ“… 2026-03-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of accurately aligning fine-grained semantics and spatial relationships in text-to-image generation, particularly for complex prompts. To this end, the authors propose the xLARD framework, which leverages a multimodal large language model to produce interpretable latent reward signals and employs a lightweight corrector to iteratively refine the generation process within the diffusion model’s latent space. A key innovation lies in the introduction of a differentiable mapping that transforms non-differentiable image-level evaluations into continuous latent rewards, thereby enabling an interpretable and intervenable self-assessment and self-correction mechanism. Experiments demonstrate that the method significantly improves semantic alignment and visual fidelity across diverse generation and editing tasks while preserving the original generative priors.

Technology Category

Application Category

πŸ“ Abstract
Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.
Problem

Research questions and friction points this paper is trying to address.

text-to-image generation
semantic alignment
spatial relations
fine-grained semantics
prompt alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-correcting generation
explainable latent rewards
multimodal large language models
latent-space refinement
differentiable reward mapping
πŸ”Ž Similar Papers
No similar papers found.