UEval: A Benchmark for Unified Multimodal Generation

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing evaluation benchmarks struggle to effectively measure the fine-grained quality of unified multimodal models in open-ended vision-language generation tasks. To address this gap, this work proposes UEval—a joint vision-language question-answering benchmark comprising 1,000 expert-designed instances spanning eight real-world task categories and diverse reasoning types—alongside the first rubric-based automated evaluation framework. By integrating outputs from multimodal large language models with human refinement, the framework establishes a fine-grained scoring system grounded in 10,417 validation criteria, enabling scalable and precise assessment of multimodal generation. Experiments reveal that even the strongest current model, GPT-5-Thinking, achieves only 66.4 out of 100 points, while the best open-source model scores 49.1. Incorporating explicit reasoning significantly enhances generation quality, and transferring reasoning traces effectively narrows the performance gap between models.

Technology Category

Application Category

📝 Abstract

We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.

Problem

Research questions and friction points this paper is trying to address.

multimodal generation

unified models

evaluation benchmark

open-ended generation

image-text generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified multimodal generation

rubric-based evaluation

multimodal benchmark