🤖 AI Summary
Existing general-purpose Chart Question Answering (CQA) benchmarks inadequately evaluate multimodal large language models’ (MLLMs) capacity for deep reasoning that integrates visual information with domain-specific knowledge. To address this, we propose DomainCQA—a scalable, systematic methodology for constructing domain-specialized CQA benchmarks—and instantiate it in astronomy via AstroChart, which incorporates expert annotation, chart semantic parsing, and cross-modal alignment. Experiments identify MLLMs’ core bottlenecks: chart-hopping reasoning, joint analysis of multiple charts, and domain-knowledge-guided summarization—rather than mere factual recall. AstroChart establishes the first rigorous, reproducible evaluation standard for domain-specialized MLLMs, advancing multimodal model assessment toward professional application scenarios.
📝 Abstract
Chart Question Answering (CQA) benchmarks are essential for evaluating the capability of Multimodal Large Language Models (MLLMs) to interpret visual data. However, current benchmarks focus primarily on the evaluation of general-purpose CQA but fail to adequately capture domain-specific challenges. We introduce DomainCQA, a systematic methodology for constructing domain-specific CQA benchmarks, and demonstrate its effectiveness by developing AstroChart, a CQA benchmark in the field of astronomy. Our evaluation shows that chart reasoning and combining chart information with domain knowledge for deeper analysis and summarization, rather than domain-specific knowledge, pose the primary challenge for existing MLLMs, highlighting a critical gap in current benchmarks. By providing a scalable and rigorous framework, DomainCQA enables more precise assessment and improvement of MLLMs for domain-specific applications.