🤖 AI Summary
Although unified multimodal models exhibit strong comprehension capabilities, they struggle to effectively guide generation, resulting in a "cognition-generation gap." This work proposes an endogenous re-prompting mechanism that dynamically generates aligned descriptors during generation, explicitly converting implicit understanding into structured reasoning steps. To bridge cognition and generation, we introduce the SEER training framework, which establishes a closed-loop alignment between the two processes. Notably, our approach incorporates a proxy task requiring only 300 samples to activate the model’s self-evaluation and optimization abilities. Furthermore, we design a two-stage reinforcement learning strategy—combining reward-verifiable RL (RLVR) and model-rewarded thinking (RLMT)—to jointly refine evaluation and generation policies. Experiments demonstrate that our method surpasses state-of-the-art approaches in evaluation accuracy, re-prompting efficiency, and generation quality, while preserving general multimodal capabilities without degradation.
📝 Abstract
Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.