Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Although unified multimodal models exhibit strong comprehension capabilities, they struggle to effectively guide generation, resulting in a "cognition-generation gap." This work proposes an endogenous re-prompting mechanism that dynamically generates aligned descriptors during generation, explicitly converting implicit understanding into structured reasoning steps. To bridge cognition and generation, we introduce the SEER training framework, which establishes a closed-loop alignment between the two processes. Notably, our approach incorporates a proxy task requiring only 300 samples to activate the model’s self-evaluation and optimization abilities. Furthermore, we design a two-stage reinforcement learning strategy—combining reward-verifiable RL (RLVR) and model-rewarded thinking (RLMT)—to jointly refine evaluation and generation policies. Experiments demonstrate that our method surpasses state-of-the-art approaches in evaluation accuracy, re-prompting efficiency, and generation quality, while preserving general multimodal capabilities without degradation.

Technology Category

Application Category

📝 Abstract

Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.

Problem

Research questions and friction points this paper is trying to address.

Cognitive Gap

Unified Multimodal Models

Generative Reasoning

Self-Evolving

Endogenous Reprompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Endogenous Reprompting

Cognitive Alignment

Reinforcement Learning with Verifiable Rewards