LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

📅 2025-06-02

📈 Citations: 1

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Current non-invasive speech decoding is hindered by limited single-subject MEG data volume and coarse annotations. To address this, we introduce the largest single-subject MEG speech decoding dataset to date—comprising over 50 hours of high-fidelity natural speech (the complete *Sherlock Holmes* corpus) with fine-grained phoneme- and word-level annotations, enabling three core tasks: speech detection, phoneme classification, and word classification. This dataset exceeds prior single-subject efforts by 5× and mainstream benchmarks by 50×, achieving, for the first time, deep-learning–ready scale for non-invasive single-subject decoding. We accompany it with a high-precision acquisition paradigm, an open-source Python toolkit, standardized data interfaces, and reproducible train/val/test splits. Baseline experiments demonstrate substantial performance gains across all three decoding tasks with increased data volume. Furthermore, we provide a unified evaluation framework to advance neural representation modeling and accelerate BCI clinical translation.

Technology Category

Application Category

📝 Abstract

LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings -- 5$ imes$ larger than the next comparable dataset and 50$ imes$ larger than most. This unprecedented `depth' of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.

Problem

Research questions and friction points this paper is trying to address.

Lack of large within-subject MEG datasets for speech decoding

Need for scalable neural decoding methods using non-invasive techniques

Improving accuracy in speech detection, phoneme, and word classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest single-subject MEG dataset

Python library for deep learning

Standard data splits for reproducibility

🔎 Similar Papers

The Brain's Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning