Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the challenge in in-context image generation and editing (ICGE) where existing models struggle to faithfully translate image-text prompts into accurate outputs. To this end, the authors propose Re-Align, a unified framework that introduces an In-Context Chain-of-Thought (IC-CoT) reasoning paradigm to decouple semantic guidance from reference alignment. By integrating a proxy reward–based reinforcement learning mechanism, Re-Align achieves structured alignment between comprehension and generation processes. Extensive experiments demonstrate that the method significantly outperforms state-of-the-art models of comparable scale across multiple ICGE benchmarks, underscoring its effectiveness and superiority in enhancing fidelity to user intent.

Technology Category

Application Category

📝 Abstract

In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

Problem

Research questions and friction points this paper is trying to address.

in-context image generation

image editing

multimodal understanding

user intent alignment

visual concept specification

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured reasoning

in-context learning

image generation and editing