SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of a systematic and reproducible evaluation benchmark for intelligent agents in multi-step scientific data analysis and visualization tasks. To this end, we propose the first structured evaluation framework tailored to scientific visualization, organized along four dimensions: application domain, data type, task complexity, and visualization operations, comprising 108 expert-designed, diverse cases. Our multimodal evaluation pipeline integrates large language model scoring, image-based metrics, code checkers, rule validators, and custom evaluators, with consistency validated through expert studies. The study establishes performance baselines for both SciVis-specific and general-purpose programming agents, revealing current limitations in their capabilities, and releases an extensible evaluation platform to support ongoing research and diagnostic analysis.
📝 Abstract
Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at https://scivisagentbench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

scientific visualization
agent evaluation
benchmark
multimodal evaluation
LLM-based agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific visualization
agent benchmark
multimodal evaluation
large language models
structured taxonomy
🔎 Similar Papers
No similar papers found.