🤖 AI Summary
Existing electronic medical record (EMR) data extraction relies heavily on manually crafted, database-specific rules, resulting in poor scalability and limited cross-institutional reusability.
Method: We propose the first large language model (LLM)-based intelligent agent framework for EMR analytics, uniquely employing SQL as the agent’s perception and decision-making medium. Our approach enables dynamic schema understanding, documentation-driven reasoning, and multi-turn interactive query optimization—supporting end-to-end automated cohort construction and feature extraction without hard-coded rules.
Contribution/Results: The framework demonstrates strong generalization across both seen and unseen schemas. Evaluated on three real-world EMR databases—MIMIC-III, eICU, and SICdb—it achieves high feature extraction accuracy, significant cross-database transferability, orders-of-magnitude improvement in data preparation efficiency, and substantially enhanced reproducibility. The implementation is open-sourced, accompanied by an online demonstration system.
📝 Abstract
Machine learning models for clinical prediction rely on structured data extracted from Electronic Medical Records (EMRs), yet this process remains dominated by hardcoded, database-specific pipelines for cohort definition, feature selection, and code mapping. These manual efforts limit scalability, reproducibility, and cross-institutional generalization. To address this, we introduce EMR-AGENT (Automated Generalized Extraction and Navigation Tool), an agent-based framework that replaces manual rule writing with dynamic, language model-driven interaction to extract and standardize structured clinical data. Our framework automates cohort selection, feature extraction, and code mapping through interactive querying of databases. Our modular agents iteratively observe query results and reason over schema and documentation, using SQL not just for data retrieval but also as a tool for database observation and decision making. This eliminates the need for hand-crafted, schema-specific logic. To enable rigorous evaluation, we develop a benchmarking codebase for three EMR databases (MIMIC-III, eICU, SICdb), including both seen and unseen schema settings. Our results demonstrate strong performance and generalization across these databases, highlighting the feasibility of automating a process previously thought to require expert-driven design. The code will be released publicly at https://github.com/AITRICS/EMR-AGENT/tree/main. For a demonstration, please visit our anonymous demo page: https://anonymoususer-max600.github.io/EMR_AGENT/