🤖 AI Summary
Existing benchmarks for data science agents often suffer from narrow task coverage, insufficient validation of strict data dependencies, and fragmented evaluation interfaces, hindering cross-benchmark comparisons and authentic capability assessment. To address these limitations, this work proposes DSGym—the first extensible, dynamic evaluation ecosystem—featuring a standardized, modular end-to-end framework that enables agent training and evaluation within self-contained execution environments. The framework incorporates a high-quality task curation mechanism to eliminate shortcut solutions that circumvent genuine data analysis and integrates three comprehensive task suites: DSGym-Tasks, DSBio (bioinformatics), and DSPredict (multi-domain prediction). Experiments demonstrate that a 4B-parameter model trained on only 2,000 synthetic examples outperforms GPT-4o on standardized analytical benchmarks, validating the framework’s efficacy in enabling efficient training and realistic scientific evaluation.
📝 Abstract
Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.