🤖 AI Summary
This work addresses the challenges of feature engineering in electronic health records (EHRs), which are characterized by irregular observations, variable measurement frequencies, and sparse structures. Existing approaches often lack clinical contextual understanding or rely on uniformly structured data. To overcome these limitations, the authors propose a privacy-preserving, tool-augmented large language model (LLM) framework that generates clinically meaningful, executable feature extraction code solely from EHR schema information and task descriptions. The method incorporates specialized temporal query routines to handle irregularly sampled data and employs an iterative validation loop to support both univariate and multivariate feature generation. Evaluated across eight clinical prediction tasks on four ICU datasets, the approach achieves state-of-the-art average AUROC performance on seven tasks, with improvements of up to 6 percentage points over strong baselines.
📝 Abstract
Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.