RealUnify: Do Unified Models Truly Benefit from Unification? A Comprehensive Benchmark

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing unified multimodal models lack rigorous evaluation of whether their understanding and generation capabilities genuinely co-evolve and mutually enhance each other. Method: We introduce RealUnify—the first benchmark explicitly designed to assess bidirectional capability synergy—comprising 10 categories, 32 fine-grained subtasks, and 1,000 human-annotated instances. It establishes a dual-axis evaluation framework (“understanding→generation” and “generation→understanding”), integrating end-to-end assessment with stepwise diagnostic protocols to precisely identify synergy bottlenecks. We employ reasoning-guided generation and mental simulation reconstruction to systematically evaluate leading models. Results: Our evaluation reveals that current unified architectures exhibit significant synergy deficits: architectural unification does not imply functional coordination. Understanding and generation remain largely decoupled, underscoring the urgent need for novel training paradigms and inductive biases that explicitly foster cross-task capability transfer and mutual reinforcement.

Technology Category

Application Category

📝 Abstract

The integration of visual understanding and generation into unified multimodal models represents a significant stride toward general-purpose AI. However, a fundamental question remains unanswered by existing benchmarks: does this architectural unification actually enable synergetic interaction between the constituent capabilities? Existing evaluation paradigms, which primarily assess understanding and generation in isolation, are insufficient for determining whether a unified model can leverage its understanding to enhance its generation, or use generative simulation to facilitate deeper comprehension. To address this critical gap, we introduce RealUnify, a benchmark specifically designed to evaluate bidirectional capability synergy. RealUnify comprises 1,000 meticulously human-annotated instances spanning 10 categories and 32 subtasks. It is structured around two core axes: 1) Understanding Enhances Generation, which requires reasoning (e.g., commonsense, logic) to guide image generation, and 2) Generation Enhances Understanding, which necessitates mental simulation or reconstruction (e.g., of transformed or disordered visual inputs) to solve reasoning tasks. A key contribution is our dual-evaluation protocol, which combines direct end-to-end assessment with a diagnostic stepwise evaluation that decomposes tasks into distinct understanding and generation phases. This protocol allows us to precisely discern whether performance bottlenecks stem from deficiencies in core abilities or from a failure to integrate them. Through large-scale evaluations of 12 leading unified models and 6 specialized baselines, we find that current unified models still struggle to achieve effective synergy, indicating that architectural unification alone is insufficient. These results highlight the need for new training strategies and inductive biases to fully unlock the potential of unified modeling.

Problem

Research questions and friction points this paper is trying to address.

Evaluating bidirectional synergy between understanding and generation in unified multimodal models

Assessing if unified models leverage understanding to enhance generation capabilities

Determining whether generative simulation facilitates deeper comprehension in unified models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates bidirectional capability synergy in models

Dual-evaluation protocol combines end-to-end and diagnostic assessments

Structured tasks test understanding-generation integration across categories

🔎 Similar Papers

No similar papers found.