π€ AI Summary
This work addresses the challenge in in-context image generation and editing (ICGE) where existing models struggle to faithfully translate image-text prompts into accurate outputs. To this end, the authors propose Re-Align, a unified framework that introduces an In-Context Chain-of-Thought (IC-CoT) reasoning paradigm to decouple semantic guidance from reference alignment. By integrating a proxy rewardβbased reinforcement learning mechanism, Re-Align achieves structured alignment between comprehension and generation processes. Extensive experiments demonstrate that the method significantly outperforms state-of-the-art models of comparable scale across multiple ICGE benchmarks, underscoring its effectiveness and superiority in enhancing fidelity to user intent.
π Abstract
In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.