MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit limited capability in code-driven medical reasoning, a critical yet underexplored domain requiring executable, interpretable, and clinically grounded inference. Method: We introduce Med-Copilot—the first large-model agent training environment specifically designed for codified medical reasoning—comprising over 72,000 real-world biomedical task instances, an executable sandbox, interactive feedback mechanisms, and verifiable ground-truth annotations. We propose a scalable training framework for medical code reasoning within a unified executable environment, incorporating supervised fine-tuning (SFT) and proximal policy optimization (PPO)-based reinforcement learning. Contribution/Results: Our study is the first to empirically reveal substantial performance gaps between commercial and open-source LLMs on such tasks. We demonstrate that a low-cost 7B-parameter model, Med-Copilot-7B, achieves GPT-4o-level accuracy after SFT (+36.44% absolute gain) and further RL refinement (+42.47%), while maintaining privacy-preserving, clinical-research-grade programming assistance capabilities.

Technology Category

Application Category

📝 Abstract
We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.
Problem

Research questions and friction points this paper is trying to address.

Enhancing coding-based medical reasoning in LLM agents
Addressing performance gap between commercial and open-source LLMs
Providing scalable training for biomedical research assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

First public training environment for medical coding
72,413 tasks from real biomedical scenarios
Combines supervised and reinforcement learning
🔎 Similar Papers
No similar papers found.