The CRITICAL Records Integrated Standardization Pipeline (CRISP): End-to-End Processing of Large-scale Multi-institutional OMOP CDM Data

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-center critical care data in OMOP CDM format exhibits institutional heterogeneity, inconsistent terminology, and massive scale, hindering interoperable clinical AI research. Method: We propose a modular, parallel processing framework integrating SNOMED-CT–based terminology standardization, cross-source deduplication, and unit-of-measure harmonization, coupled with end-to-end audit tracing and benchmark model evaluation. We innovatively design a cross-terminology mapping mechanism and a transparent data quality governance pipeline. Contribution/Results: The framework achieves full-scale, end-to-end processing within 24 hours on standard hardware, producing machine-learning–ready datasets. Our open-source pipeline and baseline models reduce preprocessing effort from months to under one day, significantly lowering barriers to entry. This enables reproducible, generalizable clinical AI research grounded in standardized, auditable, and high-quality multi-center critical care data.

Technology Category

Application Category

📝 Abstract
While existing critical care EHR datasets such as MIMIC and eICU have enabled significant advances in clinical AI research, the CRITICAL dataset opens new frontiers by providing extensive scale and diversity -- containing 1.95 billion records from 371,365 patients across four geographically diverse CTSA institutions. CRITICAL's unique strength lies in capturing full-spectrum patient journeys, including pre-ICU, ICU, and post-ICU encounters across both inpatient and outpatient settings. This multi-institutional, longitudinal perspective creates transformative opportunities for developing generalizable predictive models and advancing health equity research. However, the richness of this multi-site resource introduces substantial complexity in data harmonization, with heterogeneous collection practices and diverse vocabulary usage patterns requiring sophisticated preprocessing approaches. We present CRISP to unlock the full potential of this valuable resource. CRISP systematically transforms raw Observational Medical Outcomes Partnership Common Data Model data into ML-ready datasets through: (1) transparent data quality management with comprehensive audit trails, (2) cross-vocabulary mapping of heterogeneous medical terminologies to unified SNOMED-CT standards, with deduplication and unit standardization, (3) modular architecture with parallel optimization enabling complete dataset processing in $<$1 day even on standard computing hardware, and (4) comprehensive baseline model benchmarks spanning multiple clinical prediction tasks to establish reproducible performance standards. By providing processing pipeline, baseline implementations, and detailed transformation documentation, CRISP saves researchers months of preprocessing effort and democratizes access to large-scale multi-institutional critical care data, enabling them to focus on advancing clinical AI.
Problem

Research questions and friction points this paper is trying to address.

Harmonizing multi-institutional EHR data with heterogeneous collection practices
Mapping diverse medical terminologies to unified SNOMED-CT standards
Enabling efficient ML-ready dataset processing for clinical prediction tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end processing pipeline for OMOP CDM data
Cross-vocabulary mapping to unified SNOMED-CT standards
Modular architecture with parallel optimization capabilities
🔎 Similar Papers
No similar papers found.
X
Xiaolong Luo
School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA
Michael Lingzhi Li
Michael Lingzhi Li
Assistant Professor, Harvard Business School
Integer OptimizationCausal InferencePrecision MedicineMachine LearningAI for Healthcare