Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

πŸ“… 2026-04-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

201K/year
πŸ€– AI Summary
Existing unified multimodal models are typically evaluated in isolation on either visual understanding or generation tasks, lacking assessment of semantic consistency across tasks. To address this gap, this work proposes XTC-Bench, a novel evaluation framework that introduces the Continuous Cross-Task Agreement (CCTA) metric to disentangle a model’s internal consistency from its individual task performance at the atomic fact level. The framework leverages structured scene graphs to align comprehension queries with generation prompts and establishes the first reproducible, model-agnostic benchmark for cross-task consistency through fine-grained matching of objects, attributes, and relationships. Experiments across nine state-of-the-art models reveal that high task accuracy does not guarantee strong cross-task consistency, which is primarily influenced by the degree of coupling in cross-modal learning objectives.
πŸ“ Abstract
Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic facts, isolating internal consistency from standalone task accuracy. Extensive experiments on eight open-source and one commercial unified models reveal that high generation or understanding performance does not imply strong cross-task alignment, and architectural analysis shows consistency is governed by how tightly learning objectives are coupled across modalities, not by architectural unification alone. XTC-Bench provides a reproducible and model-agnostic framework for diagnosing representation-level misalignment, offering a concrete direction for advancing unified multimodal modeling beyond isolated task performance.
Problem

Research questions and friction points this paper is trying to address.

unified multimodal models
cross-task consistency
visual semantic alignment
evaluation benchmark
representation coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-task consistency
unified multimodal models
scene graph
semantic alignment
CCTA
πŸ”Ž Similar Papers
No similar papers found.
W
Weixing Wang
Hasso Plattner Institute / University of Potsdam, Potsdam, Germany
L
Liudvikas Zekas
Hasso Plattner Institute / University of Potsdam, Potsdam, Germany
A
Anton Hackl
Hasso Plattner Institute / University of Potsdam, Potsdam, Germany
C
Constantin Alexander Auga
Hasso Plattner Institute / University of Potsdam, Potsdam, Germany
P
Parisa Shahabinejad
Hasso Plattner Institute / University of Potsdam, Potsdam, Germany
Jona Otholt
Jona Otholt
Phd student, Hasso Plattner Institute
A
Antonio Rueda-Toicen
Hasso Plattner Institute / University of Potsdam, Potsdam, Germany
Gerard de Melo
Gerard de Melo
Professor at Hasso Plattner Institute / University of Potsdam
Artificial IntelligenceNatural Language ProcessingWeb Mining