š¤ AI Summary
Multimodal large language models (MLLMs) suffer from a āgenerationācomprehension self-contradictionā: generated images often mismatch the modelās own interpretation of the input promptāstemming primarily from weak generative capability rather than deficient comprehension. This paper proposes a self-contradiction-driven self-improvement framework that leverages the modelās stronger comprehension ability as an internal supervisory signal to guide optimization of the generation branch. Methodologically, we design a non-uniform scoring metric to quantify self-contradiction and integrate curriculum learning to progressively fine-tune the generation branch via supervised fine-tuning (SFT) and direct preference optimization (DPO) during post-training. Key findings include: (i) optimizing the generation branch can synergistically enhance comprehensionābut risks collaborative degradation; (ii) existing intrinsic consistency metrics exhibit fundamental limitations. Experiments demonstrate significant improvements in generationācomprehension alignment, false-positive misclassification detection, and bidirectional gains in both coherence and generation quality.
š Abstract
Despite efforts to unify multimodal generation and understanding tasks in a single model, we show these MLLMs exhibit self-contradiction where generation produces images deemed misaligned with input prompts based on the model's own understanding. We define a Nonunified score that quantifies such self-contradiction. Our empirical results reveal that the self-contradiction mainly arises from weak generation that fails to align with prompts, rather than misunderstanding. This capability asymmetry indicates the potential of leveraging self-contradiction for self-improvement, where the stronger model understanding guides the weaker generation to mitigate the generation-understanding gap. Applying standard post-training methods (e.g., SFT, DPO) with such internal supervision successfully improves both generation and unification. We discover a co-improvement effect on both generation and understanding when only fine-tuning the generation branch, a phenomenon known in pre-training but underexplored in post-training. Our analysis shows improvements stem from better detection of false positives that are previously incorrectly identified as prompt-aligned. Theoretically, we show the aligned training dynamics between generation and understanding allow reduced prompt-misaligned generations to also improve mismatch detection in the understanding branch. Additionally, the framework reveals a potential risk of co-degradation under poor supervision-an overlooked phenomenon that is empirically validated in our experiments. Notably, we find intrinsic metrics like Nonunified score cannot distinguish co-degradation from co-improvement, which highlights the necessity of data quality check. Finally, we propose a curriculum-based strategy based on our findings that gradually introduces harder samples as the model improves, leading to better unification and improved MLLM generation and understanding.