D3-Gym: Constructing Real-World Verifiable Environments for Data-Driven Discovery

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

The absence of verifiable and realistic research-oriented environments has significantly hindered the advancement of language models in data-driven scientific discovery. To address this, this work proposes D3-Gym—the first automatically constructed, cross-disciplinary, and verifiable benchmark environment for scientific agents—comprising 565 tasks derived from 239 real-world research repositories. Each task includes natural language instructions, an executable environment, input data, reference code, and automatically generated evaluation scripts. D3-Gym leverages an automated pipeline to produce high-quality training trajectories and reliable evaluation signals, achieving 87.5% agreement between its auto-generated scripts and human-annotated gold standards. Fine-tuning Qwen3-series models on this environment yields substantial performance gains on ScienceAgentBench, with Qwen3-32B improving by 7.8 absolute points and markedly narrowing the gap with leading closed-source models.

📝 Abstract

Despite recent progress in language models and agents for scientific data-driven discovery, further advancing their capabilities is held back by the absence of verifiable environments representing real-world scientific tasks.To fill this gap, we introduce D3-Gym, the first automatically constructed dataset with verifiable environments for scientific Data-Driven Discovery. D3-Gym comprises (1) 565 tasks sourced from 239 real scientific repositories across four disciplines where (2) each task is equipped with a natural language instruction, an executable environment with pre-installed dependencies, input dataset and artifact previews, a reference code solution, and an automatically synthesized evaluation script. Rigorous evaluation of the quality of the verification signal in D3-Gym confirms that our evaluation scripts achieve 87.5% agreement with human-annotated gold standards and strong alignment in domain-specific evaluation logic, showing their scientific soundness. Further, training on trajectories sampled from D3-Gym yields consistent and substantial gains across Qwen3 models of varying sizes on ScienceAgentBench, boosting Qwen3-32B by 7.8 absolute points and substantially shrinking the gap with strong proprietary models. All D3-Gym artifacts (environments, creation workflow, trajectories, and models) can be found at https://github.com/OSU-NLP-Group/D3-Gym.

Problem

Research questions and friction points this paper is trying to address.

data-driven discovery

verifiable environments

scientific tasks

language models

agent evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

verifiable environments

data-driven discovery

scientific agent benchmarking