DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks for data science agents often suffer from narrow task coverage, insufficient validation of strict data dependencies, and fragmented evaluation interfaces, hindering cross-benchmark comparisons and authentic capability assessment. To address these limitations, this work proposes DSGym—the first extensible, dynamic evaluation ecosystem—featuring a standardized, modular end-to-end framework that enables agent training and evaluation within self-contained execution environments. The framework incorporates a high-quality task curation mechanism to eliminate shortcut solutions that circumvent genuine data analysis and integrates three comprehensive task suites: DSGym-Tasks, DSBio (bioinformatics), and DSPredict (multi-domain prediction). Experiments demonstrate that a 4B-parameter model trained on only 2,000 synthetic examples outperforms GPT-4o on standardized analytical benchmarks, validating the framework’s efficacy in enabling efficient training and realistic scientific evaluation.

Technology Category

Application Category

📝 Abstract
Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.
Problem

Research questions and friction points this paper is trying to address.

data science agents
evaluation benchmarks
data grounding
task coverage
cross-benchmark comparison
Innovation

Methods, ideas, or system contributions that make the work stand out.

DSGym
data science agents
execution-verified training
data grounding
modular benchmarking