Neural Codecs as Biosignal Tokenizers

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Neurophysiological signals (e.g., EEG, EMG) are high-dimensional, noisy, and suffer from severe label scarcity, rendering conventional analysis reliant on labor-intensive handcrafted feature engineering. To address this, we propose BioCodec—the first framework to introduce the neural encoding–decoding paradigm to representation learning for biosignals, leveraging self-supervised pretraining to map continuous physiological time series into discrete semantic tokens. Its key contributions are: (1) the first neural tokenization of biosignals, enabling cross-modal (EEG+EMG) and multi-task (clinical diagnosis, sleep staging, motor/speech imagery decoding) transfer; (2) a learnable codebook that jointly preserves spatial topological connectivity and physiologically interpretable semantics; and (3) state-of-the-art performance under low-resource settings, significantly outperforming existing methods. The model and code are publicly released.

Technology Category

Application Category

📝 Abstract
Neurophysiological recordings such as electroencephalography (EEG) offer accessible and minimally invasive means of estimating physiological activity for applications in healthcare, diagnostic screening, and even immersive entertainment. However, these recordings yield high-dimensional, noisy time-series data that typically require extensive pre-processing and handcrafted feature extraction to reveal meaningful information. Recently, there has been a surge of interest in applying representation learning techniques from large pre-trained (foundation) models to effectively decode and interpret biosignals. We discuss the challenges posed for incorporating such methods and introduce BioCodec, an alternative representation learning framework inspired by neural codecs to capture low-level signal characteristics in the form of discrete tokens. Pre-trained on thousands of EEG hours, BioCodec shows efficacy across multiple downstream tasks, ranging from clinical diagnostic tasks and sleep physiology to decoding speech and motor imagery, particularly in low-resource settings. Additionally, we provide a qualitative analysis of codebook usage and estimate the spatial coherence of codebook embeddings from EEG connectivity. Notably, we also document the suitability of our method to other biosignal data, i.e., electromyographic (EMG) signals. Overall, the proposed approach provides a versatile solution for biosignal tokenization that performs competitively with state-of-the-art models. The source code and model checkpoints are shared.
Problem

Research questions and friction points this paper is trying to address.

Processing high-dimensional noisy neurophysiological recordings efficiently
Reducing manual feature extraction for biosignal interpretation
Enabling versatile biosignal analysis across multiple downstream tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

BioCodec framework tokenizes biosignals into discrete tokens
Pre-trained on thousands of EEG hours for multiple tasks
Applies neural codec-inspired representation learning to biosignals
🔎 Similar Papers
No similar papers found.
K
Kleanthis Avramidis
University of Southern California
Tiantian Feng
Tiantian Feng
Postdoc Researcher
Health and BehaviorsWearable ComputingAffective ComputingSpeech and BiosignalResponsible ML
W
Woojae Jeong
University of Southern California
J
Jihwan Lee
University of Southern California
Wenhui Cui
Wenhui Cui
PhD student, University of Southern California
Machine LearningMedical Imaging
R
Richard M Leahy
University of Southern California
S
Shrikanth Narayanan
University of Southern California