Self-Ensembling Vision-Language Models for Chart Data Extraction

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the challenge of accurately extracting structured tabular data from chart images, a task particularly difficult under conditions of high data density or stylistic variation where existing methods often fail. The authors propose a self-ensemble approach based on a single vision-language model (VLM), which generates multiple table predictions through repeated sampling of the same chart. A robust consensus table is then constructed by integrating cell-level alignment, median aggregation, convergence detection, and an uncertainty estimate driven by sample dispersion. To better evaluate performance under realistic complexities, the authors also introduce WB-ChartExtract, a more challenging benchmark dataset. Experimental results demonstrate significant accuracy improvements on both ChartQA and WB-ChartExtract, with gains of up to 23% over single-pass inference on the latter.

📝 Abstract

Charts effectively convey quantitative information, but the underlying data are often locked in image form, hindering reuse and analysis. Manually digitizing charts is time-consuming and error-prone, motivating automatic chart-to-table extraction. Recent approaches use specialized vision-language models (VLMs), yet performance still lags on charts with many datapoints or substantial stylistic variation. We propose a VLM self-ensembling method that repeatedly samples multiple tabular outputs from the same VLM for a fixed chart image and aggregates them at the level of individual table cells. We align candidate tables and take per-cell medians over numerical values to produce a more accurate consensus table. Our method also includes convergence detection to stop sampling once the aggregated table stabilizes, and uncertainty estimation based on dispersion across samples to help users assess extraction reliability. Because existing chart extraction benchmarks contain relatively simple plots with limited room for improvement, we introduce WB-ChartExtract, a new benchmark built from World Bank data with more complex and stylistically diverse charts; on average, its charts contain 7 times more datapoints than those in the ChartQA benchmark. Across both ChartQA and WB-ChartExtract, our approach improves extraction accuracy over single-pass VLM outputs, yielding up to 23% relative improvement on WB-ChartExtract after ensembling. More broadly, our method helps unlock tabular data previously siloed in chart images, enabling downstream analysis and reuse.

Problem

Research questions and friction points this paper is trying to address.

chart data extraction

vision-language models

tabular data

data digitization

chart-to-table conversion

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-ensembling

vision-language models

chart data extraction