🤖 AI Summary
Existing evaluation of chart understanding in Large Vision-Language Models (LVLMs) is costly and inefficient.
Method: We propose a lightweight, trustworthy automatic evaluation paradigm. First, we establish a standardized chart evaluation protocol and a fine-grained bias analysis framework—covering format compliance, positional bias, length bias, and more—to systematically assess 13 open-source LVLMs (<10B parameters) as “automatic judges” across factual correctness, informativeness, and relevance. Evaluation employs both pairwise and pointwise scoring, calibrated via multidimensional human annotations.
Contribution/Results: The top-performing open-source LVLM achieves ~80% agreement with GPT-4 judgments—approaching commercial model performance. Crucially, our analysis empirically uncovers pervasive structural biases across models. This work introduces the first open-source benchmark and diagnostic toolkit for trustworthy LVLM chart evaluation, enabling scalable, transparent, and bias-aware assessment.
📝 Abstract
Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge's accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.