🤖 AI Summary
Extracting clinically meaningful features from unstructured Electronic Health Record (EHR) clinical notes remains challenging due to heavy reliance on manual curation and the lack of interpretability and clinical relevance in automated methods. Method: We propose SNOW—a novel, modular multi-agent system powered by large language models (LLMs)—that enables end-to-end, fully automated, and interpretable structured feature generation without human intervention. SNOW orchestrates specialized agents for semantic understanding, candidate feature discovery, clinical plausibility validation, and standardized feature encoding. Contribution/Results: To our knowledge, SNOW is the first multi-agent framework applied to clinical feature engineering. Evaluated on prostate cancer recurrence prediction, SNOW achieves an AUC-ROC of 0.761—matching expert-crafted features and significantly outperforming traditional NLP baselines and representation-learning approaches. This work establishes a new paradigm for trustworthy, clinically grounded AI modeling.
📝 Abstract
Electronic health records (EHRs) contain rich unstructured clinical notes that could enhance predictive modeling, yet extracting meaningful features from these notes remains challenging. Current approaches range from labor-intensive manual clinician feature generation (CFG) to fully automated representational feature generation (RFG) that lack interpretability and clinical relevance. Here we introduce SNOW (Scalable Note-to-Outcome Workflow), a modular multi-agent system powered by large language models (LLMs) that autonomously generates structured clinical features from unstructured notes without human intervention. We evaluated SNOW against manual CFG, clinician-guided LLM approaches, and RFG methods for predicting 5-year prostate cancer recurrence in 147 patients from Stanford Healthcare. While manual CFG achieved the highest performance (AUC-ROC: 0.771), SNOW matched this performance (0.761) without requiring any clinical expertise, significantly outperforming both baseline features alone (0.691) and all RFG approaches. The clinician-guided LLM method also performed well (0.732) but still required expert input. SNOW's specialized agents handle feature discovery, extraction, validation, post-processing, and aggregation, creating interpretable features that capture complex clinical information typically accessible only through manual review. Our findings demonstrate that autonomous LLM systems can replicate expert-level feature engineering at scale, potentially transforming how clinical ML models leverage unstructured EHR data while maintaining the interpretability essential for clinical deployment.