Position: Evaluation of ECG Representations Must Be Fixed

📅 2026-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation of 12-lead ECG representation learning overly emphasizes arrhythmia classification and waveform morphology, neglecting critical clinical tasks such as structural heart disease, hemodynamic status, and patient prognosis, thereby creating a gap between research and real-world clinical needs. This work proposes a multitask evaluation framework that integrates datasets including PTB-XL, CPSC2018, and CSN, augmented with newly curated clinical labels, to systematically benchmark existing methods under class-imbalanced and multi-label settings. The study reveals for the first time the limitations of mainstream pretrained models on extended clinical tasks, showing that randomly initialized encoders can match or even surpass advanced models under linear probing. Crucially, adopting more rigorous evaluation protocols substantially alters performance rankings, underscoring the urgent need for reform in representation learning assessment standards for ECG analysis.

Technology Category

Application Category

📝 Abstract
This position paper argues that current benchmarking practice in 12-lead ECG representation learning must be fixed to ensure progress is reliable and aligned with clinically meaningful objectives. The field has largely converged on three public multi-label benchmarks (PTB-XL, CPSC2018, CSN) dominated by arrhythmia and waveform-morphology labels, even though the ECG is known to encode substantially broader clinical information. We argue that downstream evaluation should expand to include an assessment of structural heart disease and patient-level forecasting, in addition to other evolving ECG-related endpoints, as relevant clinical targets. Next, we outline evaluation best practices for multi-label, imbalanced settings, and show that when they are applied, the literature's current conclusion about which representations perform best is altered. Furthermore, we demonstrate the surprising result that a randomly initialized encoder with linear evaluation matches state-of-the-art pre-training on many tasks. This motivates the use of a random encoder as a reasonable baseline model. We substantiate our observations with an empirical evaluation of three representative ECG pre-training approaches across six evaluation settings: the three standard benchmarks, a structural disease dataset, hemodynamic inference, and patient forecasting.
Problem

Research questions and friction points this paper is trying to address.

ECG representation learning
benchmarking
clinical evaluation
structural heart disease
multi-label classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

ECG representation learning
evaluation benchmark
clinical relevance
random encoder baseline
multi-label imbalance
🔎 Similar Papers
No similar papers found.
Z
Zachary Berger
Massachusetts Institute of Technology, Cambridge, MA, USA; Massachusetts General Hospital, Boston, MA, USA
D
Daniel Prakah-Asante
Massachusetts Institute of Technology, Cambridge, MA, USA; Massachusetts General Hospital, Boston, MA, USA
John Guttag
John Guttag
Unknown affiliation
C
Collin M. Stultz
Massachusetts Institute of Technology, Cambridge, MA, USA; Massachusetts General Hospital, Boston, MA, USA