MLLM-as-a-Judge Exhibits Model Preference Bias

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the issue of model-specific preference bias in multimodal large language models (MLLMs) when employed as evaluators, which can lead to unfair assessments. The authors propose Philautia-Eval, a novel framework that, for the first time, quantifies and reveals the widespread tendency of MLLMs to exhibit self-preference and mutual preference toward models from the same family. They identify connector reuse and overlap in instruction-tuning data as key contributing factors. Leveraging 1.29 million image-text rating pairs generated by 12 diverse MLLMs, the study introduces a preference disentanglement analysis method and proposes the Pomms ensemble strategy, which significantly mitigates preference bias while preserving evaluation performance.

Technology Category

Application Category

📝 Abstract

Automatic evaluation using multimodal large language models (MLLMs), commonly referred to as MLLM-as-a-Judge, has been widely used to measure model performance. If such MLLM-as-a-Judge methods were biased, they could distort model comparisons and benchmark-driven scientific progress. However, it remains unclear to what extent MLLM-as-a-Judge methods favor or disfavor text generated by specific MLLMs. In this study, we propose Philautia-Eval to investigate such model-specific preference bias. Philautia-Eval quantifies the degree of the bias by disentangling preference tendencies from differences in generation quality. Using 1.29M caption-score pairs collected from 12 MLLMs, we found that representative MLLMs tend to exhibit self-preference bias. Moreover, experimental results indicate mutual preference bias within particular model families, which is potentially driven by reused connectors and overlapping instruction-tuning resources. Finally, we introduce a simple ensemble of MLLMs, Pomms. Our results demonstrated that Pomms effectively mitigated the model-specific preference bias while maintaining performance.

Problem

Research questions and friction points this paper is trying to address.

MLLM-as-a-Judge

model preference bias

automatic evaluation

self-preference bias

mutual preference bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

model preference bias

MLLM-as-a-Judge

Philautia-Eval

self-preference bias

Pomms

🔎 Similar Papers

No similar papers found.

Authors to Follow