🤖 AI Summary
Existing benchmarks for chart data extraction overlook the temporal structure inherent in time series, and conventional evaluation metrics are overly sensitive to minor temporal shifts, failing to reflect true model performance. To address this, this work introduces EpiCurveBench—a new benchmark comprising 1,000 real-world epidemic curve charts—and proposes EpiCurveSimilarity (ECS), a dynamic programming–based metric that tolerates local time offsets and missing segments while proportionally penalizing deviations. ECS substantially outperforms traditional metrics: it achieves a fivefold increase in sensitivity for distinguishing model performance and exhibits 1.5–3.6 times stronger correlation with downstream epidemiological estimation errors. Notably, state-of-the-art models attain only a 52.3% ECS score on this benchmark, underscoring both the task’s difficulty and the effectiveness of the proposed evaluation framework.
📝 Abstract
Chart-to-data extraction with vision-language models (VLMs) is increasingly evaluated on benchmarks that show diminishing headroom (frontier VLMs exceed 89% on ChartQA) and with metrics that treat extracted points as unordered key-value pairs, ignoring the temporal structure of time series and penalizing small alignment shifts as catastrophic failures. We address both gaps with EpiCurveBench, a benchmark of 1,000 real-world epidemic curve images curated from diverse public-health sources, and EpiCurveSimilarity (ECS), an evaluation metric that aligns predicted and ground-truth series via dynamic programming, tolerating local temporal shifts and gaps while penalizing them proportionally. Evaluating six methods--three frontier closed VLMs, one open VLM, and two specialized chart-extraction systems--we find the strongest model reaches only 52.3% ECS, and that ECS spreads the four general-purpose VLMs over a 25-point range where key-value metrics (RMS, SCRM) compress them into a 5-point band. We further validate ECS against four downstream epidemiological summary statistics, finding that higher ECS predicts smaller errors in total counts, peak timing, and peak magnitude, and higher growth-rate fidelity; across all four, ECS correlates 1.5--3.6 times more strongly than Dynamic Time Warping, which lacks a gap penalty and therefore cannot distinguish a truncated prediction from a temporally faithful one. EpiCurveBench targets a high-impact public-health application--unlocking decades of outbreak data trapped in published figures--but the benchmark and metric apply directly to any structured time-series chart-extraction setting.