Lost in Translation: Do LVLM Judges Generalize Across Languages?

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of vision-language model (VLM) evaluators beyond English by introducing MM-JudgeBench, the first large-scale multilingual multimodal benchmark. Encompassing 25 languages and over 60,000 preference pairs, it is accompanied by a multilingual training set to facilitate domain adaptation. Building upon extended versions of VL-RewardBench and OpenCQA, we design a dual-subset evaluation framework that integrates multilingual alignment and cross-lingual transfer techniques to systematically assess 22 VLMs. Our experiments reveal a significant performance drop for mainstream VLM evaluators in non-English contexts, demonstrating that neither model scale nor architecture reliably predicts multilingual robustness. These findings challenge prevailing assumptions and expose fundamental limitations in current reward modeling approaches.

Technology Category

Application Category

📝 Abstract

Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

Problem

Research questions and friction points this paper is trying to address.

multilingual generalization

vision-language models

reward models

cross-lingual evaluation

automated evaluators

Innovation

Methods, ideas, or system contributions that make the work stand out.

multilingual benchmark

vision-language models

reward modeling