CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

168K/year
🤖 AI Summary
This work addresses the limitations of existing clinical reasoning agents, which rely on manually curated tool libraries with high maintenance costs, and zero-shot code generation that often yields inefficient or unreliable reasoning chains under institutional policy constraints. To overcome these challenges, the authors propose the first composable skill framework for automated construction and evaluation in clinical reasoning. The approach formalizes natural language clinical guidelines into verified Python skills through an offline automated pipeline and introduces CodeClinic—a benchmark built on MIMIC-IV that encompasses longitudinal ICU monitoring and compositional information retrieval tasks. Experimental results demonstrate that, compared to zero-shot generation, this method maintains reasoning consistency while reducing token consumption per query by up to 40%, substantially enhancing skill reusability and reliability.
📝 Abstract
Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi-step reasoning. We further propose an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero-shot code generation, the resulting libraries improve consistency while reducing per-query token usage by up to 40%.
Problem

Research questions and friction points this paper is trying to address.

clinical reasoning agents
code generation
large language models
electronic health records
automation
Innovation

Methods, ideas, or system contributions that make the work stand out.

CodeClinic
clinical reasoning agents
autoformalization
compositional reasoning
LLM-based skill synthesis
🔎 Similar Papers
No similar papers found.