🤖 AI Summary
Current unified multimodal generation models over-rely on textual prompts while neglecting consistency with visual reference images, leading to poor preservation of critical visual attributes—such as identity, object properties, and stylistic coherence. To address this, we propose the first visual-context-consistency–oriented chain-of-reasoning framework, explicitly embedding visual consistency into the generation process via adaptive visual planning and iterative visual refinement. Our method integrates supervised fine-tuning, flow-guided reinforcement programming optimization (flow-GRPO), and a customized visual verification reward function, enabling visual checklist generation, self-reflection, and progressive optimization. Experiments demonstrate substantial improvements over zero-shot unified baselines and text-only chain-of-reasoning approaches on multi-reference image generation tasks, with significant gains in quantitative visual consistency metrics. This work establishes a novel paradigm for controllable multimodal generation grounded in explicit visual reasoning and consistency enforcement.
📝 Abstract
Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the extbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.