🤖 AI Summary
Large language models (LLMs) exhibit limited capability in code-driven medical reasoning, a critical yet underexplored domain requiring executable, interpretable, and clinically grounded inference.
Method: We introduce Med-Copilot—the first large-model agent training environment specifically designed for codified medical reasoning—comprising over 72,000 real-world biomedical task instances, an executable sandbox, interactive feedback mechanisms, and verifiable ground-truth annotations. We propose a scalable training framework for medical code reasoning within a unified executable environment, incorporating supervised fine-tuning (SFT) and proximal policy optimization (PPO)-based reinforcement learning.
Contribution/Results: Our study is the first to empirically reveal substantial performance gaps between commercial and open-source LLMs on such tasks. We demonstrate that a low-cost 7B-parameter model, Med-Copilot-7B, achieves GPT-4o-level accuracy after SFT (+36.44% absolute gain) and further RL refinement (+42.47%), while maintaining privacy-preserving, clinical-research-grade programming assistance capabilities.
📝 Abstract
We introduce MedAgentGYM, the first publicly available training environment designed to enhance coding-based medical reasoning capabilities in large language model (LLM) agents. MedAgentGYM comprises 72,413 task instances across 129 categories derived from authentic real-world biomedical scenarios. Tasks are encapsulated within executable coding environments, each featuring detailed task descriptions, interactive feedback mechanisms, verifiable ground-truth annotations, and scalable training trajectory generation. Extensive benchmarking of over 30 LLMs reveals a notable performance disparity between commercial API-based models and open-source counterparts. Leveraging MedAgentGYM, Med-Copilot-7B achieves substantial performance gains through supervised fine-tuning (+36.44%) and continued reinforcement learning (+42.47%), emerging as an affordable and privacy-preserving alternative competitive with gpt-4o. By offering both a comprehensive benchmark and accessible, expandable training resources within unified execution environments, MedAgentGYM delivers an integrated platform to develop LLM-based coding assistants for advanced biomedical research and practice.