🤖 AI Summary
Existing ECG foundation models typically utilize only the final-layer representations from Vision Transformers (ViTs), leading to suboptimal exploitation of hierarchical feature information. Method: We propose Post-pretraining Mixture-of-layers Aggregation (PMA), a novel architecture that employs a learnable gating network to dynamically weight and fuse hidden representations across all ViT layers; additionally, we introduce a grouped mean aggregation strategy during post-pretraining to enhance inter-layer diversity modeling. Contribution/Results: PMA is the first method to systematically uncover and synergistically leverage the complementary nature of multi-layer representations in ECG Transformers, thereby transcending the conventional single-layer representation paradigm. Extensive experiments across diverse downstream tasks—including arrhythmia classification and abnormality detection—demonstrate that PMA consistently outperforms strong baselines, validating the effectiveness, robustness, and generalizability of adaptive multi-layer representation fusion for ECG analysis.
📝 Abstract
Transformer-based foundation models for Electrocardiograms (ECGs) have recently achieved impressive performance in many downstream applications. However, the internal representations of such models across layers have not been fully understood and exploited. An important question arises: Does the final layer of the pre-trained Transformer model, the emph{de facto} representational layer, provide optimal performance for downstream tasks? Although our answer based on empirical and theoretical analyses for this question is negative, we propose a novel approach to leverage the representation diversity of the model's layers effectively. Specifically, we introduce a novel architecture called Post-pretraining Mixture-of-layers Aggregation (PMA), which enables a flexible combination of the layer-wise representations from the layer stack of a Transformer-based foundation model. We first pre-train the model from ECG signals using the 1-dimensional Vision Transformer (ViT) via masked modeling. In downstream applications, instead of relying solely on the last layer of the model, we employ a gating network to selectively fuse the representations from the pretrained model's layers, thereby enhancing representation power and improving performance of the downstream applications. In addition, we extend the proposed method to the pretraining stage by aggregating all representations through group-wise averaging before feeding them into the decoder-based Transformer.