CODE-II: A large-scale dataset for artificial intelligence in ECG analysis

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current AI-driven electrocardiogram (ECG) analysis is hindered by low-quality annotations, limited data scale, and narrow clinical coverage. To address these limitations, we introduce CODE-II—the first large-scale, high-fidelity, real-world 12-lead ECG dataset, comprising 2.73 million expert-validated recordings spanning 66 clinically well-defined diagnostic categories. We propose a standardized diagnostic taxonomy and release two benchmark subsets: CODE-II-open (publicly available) and CODE-II-test (multi-expert blinded evaluation set), substantially enhancing model assessment rigor and reproducibility. Leveraging standardized annotation protocols and a pretraining–fine-tuning paradigm on CODE-II, our models achieve state-of-the-art performance on external benchmarks—including PTB-XL and CPSC 2018—outperforming models trained on larger but lower-quality datasets. CODE-II thus establishes a new foundation for robust, generalizable, and clinically relevant ECG interpretation.

Technology Category

Application Category

📝 Abstract
Data-driven methods for electrocardiogram (ECG) interpretation are rapidly progressing. Large datasets have enabled advances in artificial intelligence (AI) based ECG analysis, yet limitations in annotation quality, size, and scope remain major challenges. Here we present CODE-II, a large-scale real-world dataset of 2,735,269 12-lead ECGs from 2,093,807 adult patients collected by the Telehealth Network of Minas Gerais (TNMG), Brazil. Each exam was annotated using standardized diagnostic criteria and reviewed by cardiologists. A defining feature of CODE-II is a set of 66 clinically meaningful diagnostic classes, developed with cardiologist input and routinely used in telehealth practice. We additionally provide an open available subset: CODE-II-open, a public subset of 15,000 patients, and the CODE-II-test, a non-overlapping set of 8,475 exams reviewed by multiple cardiologists for blinded evaluation. A neural network pre-trained on CODE-II achieved superior transfer performance on external benchmarks (PTB-XL and CPSC 2018) and outperformed alternatives trained on larger datasets.
Problem

Research questions and friction points this paper is trying to address.

Developing large-scale ECG dataset with standardized cardiologist-reviewed annotations
Addressing limitations in annotation quality and dataset scope for AI analysis
Providing clinically meaningful diagnostic classes for improved ECG interpretation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with 2.7 million ECGs
Standardized diagnostic annotations by cardiologists
Neural network pre-training for superior transfer performance
🔎 Similar Papers
No similar papers found.
P
P. E. O. G. B. Abreu
Universidade Federal de Minas Gerais (UFMG), Brazil
G
Gabriela M. M. Paixão
Universidade Federal de Minas Gerais (UFMG), Brazil
J
Jiawei Li
Uppsala University, Sweden
P
Paulo R. Gomes
Universidade Federal de Minas Gerais (UFMG), Brazil
P
Peter W. Macfarlane
University of Glasgow, Scotland
A
Ana C. S. Oliveira
Universidade Federal de Minas Gerais (UFMG), Brazil
V
Vinicius T. Carvalho
Universidade Federal de Minas Gerais (UFMG), Brazil
T
Thomas B. Schon
Uppsala University, Sweden
Antonio Luiz P. Ribeiro
Antonio Luiz P. Ribeiro
Universidade Federal de Minas Gerais, Brasil
CardiologyChagas diseaseElectrocardiographyCardiovascular epidemiologyTelemedicine
A
Antonio H. Ribeiro
Uppsala University, Sweden