🤖 AI Summary
A large-scale, standardized evaluation benchmark for scientific visualization (SciVis) agents is currently lacking, hindering quantitative capability assessment and technological advancement in the era of multimodal large language models (MLLMs).
Method: This work adopts an evaluation-centric research paradigm, systematically analyzing the evaluation requirements and challenges for SciVis agents, and introduces the first task-oriented, scalable, and comprehensive evaluation framework. The framework integrates multimodal large language models, automated visualization generation, and fine-grained capability decomposition to enable unified assessment of core competencies—including visual understanding, reasoning, generation, and interactive capabilities.
Contribution/Results: We propose a “centralized evaluation” paradigm and demonstrate its feasibility via proof-of-concept implementation. This lays the foundation for an open, reproducible, community-driven benchmark for SciVis agents, fostering collaborative progress and autonomous optimization in the field.
📝 Abstract
Recent advances in multi-modal large language models (MLLMs) have enabled increasingly sophisticated autonomous visualization agents capable of translating user intentions into data visualizations. However, measuring progress and comparing different agents remains challenging, particularly in scientific visualization (SciVis), due to the absence of comprehensive, large-scale benchmarks for evaluating real-world capabilities. This position paper examines the various types of evaluation required for SciVis agents, outlines the associated challenges, provides a simple proof-of-concept evaluation example, and discusses how evaluation benchmarks can facilitate agent self-improvement. We advocate for a broader collaboration to develop a SciVis agentic evaluation benchmark that would not only assess existing capabilities but also drive innovation and stimulate future development in the field.