Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Although unified multimodal models exhibit strong comprehension capabilities, they struggle to effectively guide generation, resulting in a "cognition-generation gap." This work proposes an endogenous re-prompting mechanism that dynamically generates aligned descriptors during generation, explicitly converting implicit understanding into structured reasoning steps. To bridge cognition and generation, we introduce the SEER training framework, which establishes a closed-loop alignment between the two processes. Notably, our approach incorporates a proxy task requiring only 300 samples to activate the model’s self-evaluation and optimization abilities. Furthermore, we design a two-stage reinforcement learning strategy—combining reward-verifiable RL (RLVR) and model-rewarded thinking (RLMT)—to jointly refine evaluation and generation policies. Experiments demonstrate that our method surpasses state-of-the-art approaches in evaluation accuracy, re-prompting efficiency, and generation quality, while preserving general multimodal capabilities without degradation.

Technology Category

Application Category

📝 Abstract
Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.
Problem

Research questions and friction points this paper is trying to address.

Cognitive Gap
Unified Multimodal Models
Generative Reasoning
Self-Evolving
Endogenous Reprompting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Endogenous Reprompting
Cognitive Alignment
Reinforcement Learning with Verifiable Rewards
Self-Evolving Evaluator and Reprompter
Unified Multimodal Models
🔎 Similar Papers
No similar papers found.
Z
Zhenchen Tang
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
S
Songlin Yang
MMLab@HKUST, The Hong Kong University of Science and Technology
Z
Zichuan Wang
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Bo Peng
Bo Peng
Institute of Automation, Chinese Academy of Sciences
Computer VisionImage ForensicsBiometrics and Security
Yang Li
Yang Li
Institute of Automation, Chinese Academy of Sciences
MLLMAgentbrain-inspired intelligenceArtificial intelligence
B
Beibei Dong
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
J
Jing Dong
New Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences