Judge Anything: MLLM as a Judge Across Any Modality

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the lack of a unified, scalable evaluation paradigm for generative multimodal large language models (MLLMs) on open-ended multimodal understanding (MMU) and generation (MMG) tasks. We propose a cross-modal unified evaluation framework that employs MLLMs as automatic judges. Our key contributions are: (1) introducing the novel “judgment-as-benchmark” paradigm, instantiated by TaskAnything (a diverse task suite) and JudgeAnything—the first unified benchmark covering 15 multimodal combinations; (2) designing a dual-dimension judgment protocol—pairwise comparison and fine-grained scoring—integrated with human-aligned rubrics and the OmniArena automation platform; and (3) revealing significant cross-modal bias and hallucination in MLLM generation, with judgment accuracy at 66.55%/53.37% (pairwise) and 42.79%/30.05% (scoring) for MMU/MMG, respectively. All code, data, and frameworks are publicly released.

Technology Category

Application Category

📝 Abstract
Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating multimodal models across diverse modalities is challenging
Assessing MLLM judging capabilities for any-to-any modality tasks
Addressing cross-modality biases in generative model evaluations
Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLMs as judges for any-to-any modality tasks
OmniArena platform for automated omni-model evaluation
Standardized benchmarks TaskAnything and JudgeAnything
🔎 Similar Papers
No similar papers found.
Shu Pu
Shu Pu
Huazhong University of Science and Technology
3D VisionGeometry and Graphics3D Representation
Y
Yaochen Wang
Huazhong University of Science and Technology
D
Dongping Chen
Huazhong University of Science and Technology
Y
Yuhang Chen
Huazhong University of Science and Technology
G
Guohao Wang
Huazhong University of Science and Technology
Q
Qi Qin
Huazhong University of Science and Technology
Zhongyi Zhang
Zhongyi Zhang
Huazhong University of Science and Technology
Z
Zhiyuan Zhang
Huazhong University of Science and Technology
Z
Zetong Zhou
Huazhong University of Science and Technology
S
Shuang Gong
Huazhong University of Science and Technology
Yi Gui
Yi Gui
Huazhong University of Science and Technology
Yao Wan
Yao Wan
Huazhong University of Science and Technology
NLPProgramming LanguagesSoftware EngineeringLarge Language Models
Philip S. Yu
Philip S. Yu
Professor of Computer Science, University of Illinons at Chicago
Data miningDatabasePrivacy