Self-Corrected Image Generation with Explainable Latent Rewards

📅 2026-03-25

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the challenge of accurately aligning fine-grained semantics and spatial relationships in text-to-image generation, particularly for complex prompts. To this end, the authors propose the xLARD framework, which leverages a multimodal large language model to produce interpretable latent reward signals and employs a lightweight corrector to iteratively refine the generation process within the diffusion model’s latent space. A key innovation lies in the introduction of a differentiable mapping that transforms non-differentiable image-level evaluations into continuous latent rewards, thereby enabling an interpretable and intervenable self-assessment and self-correction mechanism. Experiments demonstrate that the method significantly improves semantic alignment and visual fidelity across diverse generation and editing tasks while preserving the original generative priors.

Technology Category

Application Category

📝 Abstract

Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

Problem

Research questions and friction points this paper is trying to address.

text-to-image generation

semantic alignment

spatial relations

fine-grained semantics

prompt alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-correcting generation

explainable latent rewards

multimodal large language models