🤖 AI Summary
Existing unified multimodal image refinement methods rely on editing instructions that only coarsely localize regions of text-image inconsistency and are constrained by strict pixel-level content preservation, leading to insufficient semantic alignment. This work proposes a Refinement via Regeneration (RvR) framework that reframes refinement as a conditional image regeneration task. By jointly leveraging target prompts and semantic tokens extracted from the initial image to guide the generation process, RvR dispenses with conventional editing instructions and rigid content retention constraints, thereby substantially expanding the space of feasible modifications. Built upon a unified multimodal model and integrating semantic token extraction, conditional regeneration, and end-to-end fine-tuning, the proposed method achieves state-of-the-art results of 0.91, 87.21, and 77.41 on Geneval, DPGBench, and UniGenBench++, respectively, significantly outperforming existing approaches.
📝 Abstract
Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.