Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses the challenge that existing unified multimodal large language models struggle to maintain semantic consistency between textual and visual outputs—failing to express equivalent reasoning outcomes across modalities. To this end, the paper introduces, for the first time, a “semantic equivalence” evaluation perspective and constructs VGUBench, a diagnostic benchmark that decouples reasoning logic from generation fidelity. The framework systematically assesses model performance across three task categories: textual understanding, visual answering, and visual rendering. Experimental results reveal that while models perform well in textual reasoning and basic image rendering, their visual answering capability degrades significantly, with no strong correlation to rendering quality. This finding exposes a fundamental flaw in current cross-modal semantic alignment mechanisms.

Technology Category

Application Category

📝 Abstract

Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture. However, existing evaluations typically assess these capabilities separately, overlooking semantic equivalence, i.e., the ability to manifest consistent reasoning results regardless of the output modality. In this work, we investigate whether current U-MLLMs satisfy this premise. We observe that while models demonstrate robust textual reasoning, they fail to maintain semantic equivalence when required to render the same results in the image modality. To rigorously diagnose this discrepancy, we introduce VGUBench, a framework to decouple reasoning logic from generation fidelity. VGUBench comprises three diagnostic tasks: (1)Textual Generative Understanding, establishing a baseline for reasoning accuracy in textual response; (2)Visual Generative Understanding, evaluating the ability to generate visual responses that represent the correct answer; and (3)a Visual Rendering control task, which assesses the ability to directly render explicit visual descriptions into images without complex reasoning. Our evaluation reveals a significant disparity: despite strong performance in textual understanding and visual rendering, U-MLLMs exhibit a marked performance collapse when required to generate visual answers to questions. Furthermore, we find a negligible correlation between visual answering performance and basic rendering quality. These results suggest that the failure stems not from insufficient generation fidelity, but from a breakdown in cross-modal semantic alignment. We provide diagnostic insights to address this challenge in future Unified Generation and Understanding Models.

Problem

Research questions and friction points this paper is trying to address.

semantic equivalence

multimodal generation

unified models

cross-modal alignment

visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic equivalence

unified multimodal models

cross-modal alignment