Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing evaluation of chart understanding in Large Vision-Language Models (LVLMs) is costly and inefficient. Method: We propose a lightweight, trustworthy automatic evaluation paradigm. First, we establish a standardized chart evaluation protocol and a fine-grained bias analysis framework—covering format compliance, positional bias, length bias, and more—to systematically assess 13 open-source LVLMs (<10B parameters) as “automatic judges” across factual correctness, informativeness, and relevance. Evaluation employs both pairwise and pointwise scoring, calibrated via multidimensional human annotations. Contribution/Results: The top-performing open-source LVLM achieves ~80% agreement with GPT-4 judgments—approaching commercial model performance. Crucially, our analysis empirically uncovers pervasive structural biases across models. This work introduces the first open-source benchmark and diagnostic toolkit for trustworthy LVLM chart evaluation, enabling scalable, transparent, and bias-aware assessment.

Technology Category

Application Category

📝 Abstract

Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge's accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LVLMs' fairness in assessing chart comprehension tasks

Addressing challenges in cost and accessibility for LVLM-based evaluations

Analyzing biases in LVLM judges for chart reasoning accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates 13 open-source LVLMs as judges

Designs pairwise and pointwise evaluation tasks

Focuses on cost-effective LVLMs under 10B parameters

🔎 Similar Papers

CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models