ACES: Automatic Cohort Extraction System for Event-Stream Datasets

📅 2024-06-28
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Medical machine learning faces irreproducibility in task definition and cohort construction due to the privatization of electronic health record (EHR) data. Method: We propose an automated cohort extraction system for event-stream EHR data, introducing the first domain-specific configuration language (DSL) designed specifically for event streams. This DSL decouples general inclusion/exclusion criteria from dataset-specific clinical concepts. The system supports zero-code adaptation to standardized EHR formats (e.g., MEDS, ESGPT) via a modular pipeline integrating DSL compilation, event-stream parsing, rule-driven cohort generation, and standardized interface adapters. Contribution/Results: Experiments across multi-institutional real-world EHR datasets demonstrate consistent cohort definitions across formats and institutions. Our approach significantly lowers the barrier to task specification, enables concept-level reproducibility across datasets, ensures precise cohort replication within a dataset, and enhances reproducibility and collaborative efficiency in EHR-based research.

Technology Category

Application Category

📝 Abstract
Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task or cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. We address a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of tasks and cohorts for ML in healthcare and also enable their reproduction, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides: (1) a highly intuitive and expressive domain-specific configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion or exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or Event Stream GPT (ESGPT) formats, or to *any* dataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks in representation learning, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies using this modality. ACES is available at: https://github.com/justin13601/aces.
Problem

Research questions and friction points this paper is trying to address.

Addresses reproducibility challenges in healthcare ML.
Simplifies task and cohort development for ML in healthcare.
Enables automatic extraction of patient records from event-stream data.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intuitive domain-specific configuration language
Pipeline for automatic patient record extraction
Compatibility with MEDS and ESGPT formats
🔎 Similar Papers
No similar papers found.
J
Justin Xu
University of Oxford
J
J. Gallifant
Massachusetts Institute of Technology
A
Alistair E. W. Johnson
Independent Scientist
Matthew B. A. McDermott
Matthew B. A. McDermott
Assistant Professor, Columbia University Department of Biomedical Informatics
Machine LearningBiomedical Informatics