Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

πŸ“… 2026-02-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
It remains unclear whether current unified multimodal models achieve deep cognitive alignment between understanding and generation capabilities. To address this, this work proposes GapEvalβ€”the first bidirectional evaluation benchmark specifically designed for symmetric image-text question answering. By leveraging cross-modal consistency assessment and knowledge manipulation analysis, GapEval systematically measures cognitive coherence and reasoning ability across bidirectional tasks. Experimental results reveal that prevailing unified architectures consistently exhibit a performance gap between comprehension and generation, with insufficient synchronization of knowledge across modalities, indicating that these models achieve only superficial unification without genuine deep cognitive integration. This study thus introduces a novel evaluation paradigm and analytical perspective for assessing cognitive alignment in multimodal models.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two"unified"directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model's bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of the two. To further explore the underlying mechanism, we conduct an empirical study from the perspective of knowledge manipulation to illustrate the underlying limitations. Our findings indicate that knowledge within UMMs often remains disjoint. The capability emergence and knowledge across modalities are unsynchronized, paving the way for further exploration.
Problem

Research questions and friction points this paper is trying to address.

unified multimodal models
understanding-generation gap
cognitive coherence
cross-modal consistency
knowledge alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified multimodal models
understanding-generation gap
GapEval
cross-modal consistency
knowledge manipulation
πŸ”Ž Similar Papers
No similar papers found.