Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing unified multimodal large language models struggle to maintain semantic consistency between textual and visual outputs—failing to express equivalent reasoning outcomes across modalities. To this end, the paper introduces, for the first time, a “semantic equivalence” evaluation perspective and constructs VGUBench, a diagnostic benchmark that decouples reasoning logic from generation fidelity. The framework systematically assesses model performance across three task categories: textual understanding, visual answering, and visual rendering. Experimental results reveal that while models perform well in textual reasoning and basic image rendering, their visual answering capability degrades significantly, with no strong correlation to rendering quality. This finding exposes a fundamental flaw in current cross-modal semantic alignment mechanisms.

Technology Category

Application Category

📝 Abstract
Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture. However, existing evaluations typically assess these capabilities separately, overlooking semantic equivalence, i.e., the ability to manifest consistent reasoning results regardless of the output modality. In this work, we investigate whether current U-MLLMs satisfy this premise. We observe that while models demonstrate robust textual reasoning, they fail to maintain semantic equivalence when required to render the same results in the image modality. To rigorously diagnose this discrepancy, we introduce VGUBench, a framework to decouple reasoning logic from generation fidelity. VGUBench comprises three diagnostic tasks: (1)Textual Generative Understanding, establishing a baseline for reasoning accuracy in textual response; (2)Visual Generative Understanding, evaluating the ability to generate visual responses that represent the correct answer; and (3)a Visual Rendering control task, which assesses the ability to directly render explicit visual descriptions into images without complex reasoning. Our evaluation reveals a significant disparity: despite strong performance in textual understanding and visual rendering, U-MLLMs exhibit a marked performance collapse when required to generate visual answers to questions. Furthermore, we find a negligible correlation between visual answering performance and basic rendering quality. These results suggest that the failure stems not from insufficient generation fidelity, but from a breakdown in cross-modal semantic alignment. We provide diagnostic insights to address this challenge in future Unified Generation and Understanding Models.
Problem

Research questions and friction points this paper is trying to address.

semantic equivalence
multimodal generation
unified models
cross-modal alignment
visual reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic equivalence
unified multimodal models
cross-modal alignment
VGUBench
visual generative understanding
🔎 Similar Papers
No similar papers found.
Hongbo Jiang
Hongbo Jiang
Hunan University
Mobile ComputingWireless NetworkingPrivacy Preserving
Jie Li
Jie Li
China University of Mining and Technology
Emotion Recognition in Conversation
Y
Yunhang Shen
Tencent Youtu Lab, Shanghai, China
P
Pingyang Dai
Xiamen University, Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen, China
Xing Sun
Xing Sun
Tencent Youtu Lab
LLMMLLMAgent
H
Haoyu Cao
Tencent Youtu Lab, Shanghai, China
L
Liujuan Cao
Xiamen University, Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen, China