SciVisAgentBench: A Benchmark for Evaluating Scientific Data Analysis and Visualization Agents

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the lack of a systematic and reproducible evaluation benchmark for intelligent agents in multi-step scientific data analysis and visualization tasks. To this end, we propose the first structured evaluation framework tailored to scientific visualization, organized along four dimensions: application domain, data type, task complexity, and visualization operations, comprising 108 expert-designed, diverse cases. Our multimodal evaluation pipeline integrates large language model scoring, image-based metrics, code checkers, rule validators, and custom evaluators, with consistency validated through expert studies. The study establishes performance baselines for both SciVis-specific and general-purpose programming agents, revealing current limitations in their capabilities, and releases an extensible evaluation platform to support ongoing research and diagnostic analysis.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have enabled agentic systems that translate natural language intent into executable scientific visualization (SciVis) tasks. Despite rapid progress, the community lacks a principled and reproducible benchmark for evaluating these emerging SciVis agents in realistic, multi-step analysis settings. We present SciVisAgentBench, a comprehensive and extensible benchmark for evaluating scientific data analysis and visualization agents. Our benchmark is grounded in a structured taxonomy spanning four dimensions: application domain, data type, complexity level, and visualization operation. It currently comprises 108 expert-crafted cases covering diverse SciVis scenarios. To enable reliable assessment, we introduce a multimodal outcome-centric evaluation pipeline that combines LLM-based judging with deterministic evaluators, including image-based metrics, code checkers, rule-based verifiers, and case-specific evaluators. We also conduct a validity study with 12 SciVis experts to examine the agreement between human and LLM judges. Using this framework, we evaluate representative SciVis agents and general-purpose coding agents to establish initial baselines and reveal capability gaps. SciVisAgentBench is designed as a living benchmark to support systematic comparison, diagnose failure modes, and drive progress in agentic SciVis. The benchmark is available at https://scivisagentbench.github.io/.

Problem

Research questions and friction points this paper is trying to address.

scientific visualization

agent evaluation

benchmark

multimodal evaluation

LLM-based agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

scientific visualization

agent benchmark

multimodal evaluation