Self-Contradiction as Self-Improvement: Mitigating the Generation-Understanding Gap in MLLMs

šŸ“… 2025-07-22
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Multimodal large language models (MLLMs) suffer from a ā€œgeneration–comprehension self-contradictionā€: generated images often mismatch the model’s own interpretation of the input prompt—stemming primarily from weak generative capability rather than deficient comprehension. This paper proposes a self-contradiction-driven self-improvement framework that leverages the model’s stronger comprehension ability as an internal supervisory signal to guide optimization of the generation branch. Methodologically, we design a non-uniform scoring metric to quantify self-contradiction and integrate curriculum learning to progressively fine-tune the generation branch via supervised fine-tuning (SFT) and direct preference optimization (DPO) during post-training. Key findings include: (i) optimizing the generation branch can synergistically enhance comprehension—but risks collaborative degradation; (ii) existing intrinsic consistency metrics exhibit fundamental limitations. Experiments demonstrate significant improvements in generation–comprehension alignment, false-positive misclassification detection, and bidirectional gains in both coherence and generation quality.

Technology Category

Application Category

šŸ“ Abstract
Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model's own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.
Problem

Research questions and friction points this paper is trying to address.

MLLMs exhibit self-contradiction between generation and understanding
Weak generation fails to align with prompts, causing capability asymmetry
Internal supervision improves generation and understanding via co-improvement effect
Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantifies self-contradiction via Nonunified score
Uses internal supervision for model self-improvement
Proposes curriculum-based strategy for better unification
šŸ”Ž Similar Papers
No similar papers found.