EEG-DLite: Dataset Distillation for Efficient Large EEG Model Training

📅 2025-12-13

📈 Citations: 0

✨ Influential: 0

career value

249K/year

🤖 AI Summary

To address the challenges of large-scale, noisy data and high computational cost in pretraining foundational EEG models, this paper proposes EEG-DLite—a data distillation framework for EEG. It leverages a self-supervised autoencoder to learn robust latent representations, then integrates latent-space anomaly detection with redundancy-aware subset optimization to remove low-quality samples while preserving temporal structural diversity. This work constitutes the first systematic study on data distillation specifically designed for foundational EEG model pretraining, innovatively unifying noise-robust sample selection with information-density optimization. Empirical results demonstrate that training on only 5% (125 hours) of the original 2,500-hour dataset yields downstream performance comparable to—or even exceeding—that of full-dataset training, thereby substantially reducing computational resource requirements without sacrificing model efficacy.

Technology Category

Application Category

📝 Abstract

Large-scale EEG foundation models have shown strong generalization across a range of downstream tasks, but their training remains resource-intensive due to the volume and variable quality of EEG data. In this work, we introduce EEG-DLite, a data distillation framework that enables more efficient pre-training by selectively removing noisy and redundant samples from large EEG datasets. EEG-DLite begins by encoding EEG segments into compact latent representations using a self-supervised autoencoder, allowing sample selection to be performed efficiently and with reduced sensitivity to noise. Based on these representations, EEG-DLite filters out outliers and minimizes redundancy, resulting in a smaller yet informative subset that retains the diversity essential for effective foundation model training. Through extensive experiments, we demonstrate that training on only 5 percent of a 2,500-hour dataset curated with EEG-DLite yields performance comparable to, and in some cases better than, training on the full dataset across multiple downstream tasks. To our knowledge, this is the first systematic study of pre-training data distillation in the context of EEG foundation models. EEG-DLite provides a scalable and practical path toward more effective and efficient physiological foundation modeling. The code is available at https://github.com/t170815518/EEG-DLite.

Problem

Research questions and friction points this paper is trying to address.

Distills large EEG datasets to reduce training resource demands

Selectively removes noisy and redundant EEG samples for efficiency

Enables effective model training with a small, informative data subset

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distills large EEG datasets by removing noisy and redundant samples

Uses self-supervised autoencoder for compact latent representations and outlier filtering

Trains foundation models on a small informative subset with comparable performance

🔎 Similar Papers

BrainWave: A Brain Signal Foundation Model for Clinical Applications