🤖 AI Summary
This work addresses the limitations of existing physiological signal foundation models, which suffer from modality entanglement due to device heterogeneity, poor cross-frequency generation capability, and high computational overhead. The authors propose a two-stage discrete translation paradigm: first, a hierarchical residual vector quantization scheme constructs a universal tokenizer that disentangles heterogeneous signals—such as ECG and PPG—into structured discrete latent representations; second, a physiology-informed, context-prompt-driven latent translator enables cross-modal sequence conversion. This framework is the first to unify modeling in a discrete latent space, effectively eliminating modality interference and substantially improving fidelity in both cross-modal synthesis and cross-frequency super-resolution, while reducing model size to 0.09B parameters for edge deployment. Experiments show the F1 score for R-peak detection in PPG-to-ECG synthesis improves from 0.37 to 0.83, and Pearson correlation reaches 0.9956 in 25Hz-to-100Hz super-resolution, significantly outperforming large-scale baselines.
📝 Abstract
The analysis of physiological time series, such as electrocardiograms (ECG) and photoplethysmograms (PPG), is persistently hindered by modality and frequency gaps stemming from heterogeneous recording devices. Existing foundation models typically rely on continuous latent spaces, which frequently suffer from severe modality entanglement, lack high-fidelity cross-frequency generative capacity, and impose high computational costs that prohibit edge-device deployment. In this paper, we propose Compact Latent Manifold Translation (CLMT), a highly parameter-efficient (0.09B) unified framework that bridges these gaps through a novel two-stage discrete translation paradigm. First, we introduce a Universal Tokenizer utilizing Hierarchical Residual Vector Quantization (RVQ) to decouple heterogeneous signals into isolated, well-structured discrete latent manifolds, effectively preventing inter-modality interference. Second, a Context-Prompted Latent Translator maps these discrete tokens across modalities by integrating static physiological priors, reframing complex signal synthesis as a pure latent sequence translation task. Extensive evaluations demonstrate that our 0.09B model significantly outperforms massive baselines. In cross-modal PPG-to-ECG synthesis, it resolves temporal phase drift and dramatically improves the clinical R-peak detection F1-score from 0.37 (baseline) to 0.83. Furthermore, in extreme cross-frequency super-resolution (25Hz to 100Hz), it successfully recovers high-frequency diagnostic landmarks, achieving an unprecedented Pearson correlation of 0.9956. By learning a universal discrete language for biological signals with a fraction of the computational footprint, our approach sets a new trajectory for edge-deployable, multi-modal medical foundation models.