π€ AI Summary
Unified multimodal models (UMMs) commonly exhibit a significant gap between strong visual understanding and weak generative capability. To address this, we propose Self-Rewarding Unified Multimodal learning (SRUM), the first framework that leverages the modelβs own understanding module as a parameter-free internal evaluator, establishing a global-local dual-level feedback mechanism for fine-grained self-supervised optimization of the generation module. SRUM integrates multi-scale semantic alignment with layout consistency constraints, enabling annotation-free post-training across diverse UMM architectures. On T2I-CompBench and T2I-ReasonBench, SRUM achieves gains of +6.19 and +2.93, respectively, substantially improving image generation fidelity and cross-modal reasoning ability. Our approach establishes a scalable, closed-loop paradigm for the autonomous evolution of UMMs.
π Abstract
Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model's strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model's own understanding module acts as an internal ``evaluator'', providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a extbf{global reward} ensures the correctness of the overall visual semantics and layout, while a extbf{local reward} refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to extbf{88.37} and on T2I-ReasonBench from 43.82 to extbf{46.75}. Overall, our work establishes a powerful new paradigm for enabling a UMMs'understanding module to guide and enhance its own generation via self-rewarding.