Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

📅 2025-07-22

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Multimodal large language models (MLLMs) suffer from a “generation–comprehension self-contradiction”: generated images often mismatch the model’s own interpretation of the input prompt—stemming primarily from weak generative capability rather than deficient comprehension. This paper proposes a self-contradiction-driven self-improvement framework that leverages the model’s stronger comprehension ability as an internal supervisory signal to guide optimization of the generation branch. Methodologically, we design a non-uniform scoring metric to quantify self-contradiction and integrate curriculum learning to progressively fine-tune the generation branch via supervised fine-tuning (SFT) and direct preference optimization (DPO) during post-training. Key findings include: (i) optimizing the generation branch can synergistically enhance comprehension—but risks collaborative degradation; (ii) existing intrinsic consistency metrics exhibit fundamental limitations. Experiments demonstrate significant improvements in generation–comprehension alignment, false-positive misclassification detection, and bidirectional gains in both coherence and generation quality.

Technology Category

Application Category

📝 Abstract

Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model's own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.

Problem

Research questions and friction points this paper is trying to address.

MLLMs exhibit self-contradiction between generation and understanding

Weak generation fails to align with prompts, causing capability asymmetry

Internal supervision improves generation and understanding via co-improvement effect

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies self-contradiction via Nonunified score

Uses internal supervision for model self-improvement

Proposes curriculum-based strategy for better unification

🔎 Similar Papers

No similar papers found.