Decoding inner speech with an end-to-end brain-to-text neural interface

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing speech-based brain–computer interfaces (BCIs) predominantly employ cascaded decoding (phoneme → text), hindering end-to-end joint optimization and failing to unify the neural representations of attempted and imagined speech. Method: We propose the first end-to-end Brain-to-Text framework, featuring a cross-task, cross-species pretrained neural encoder; integrated cross-modal contrastive learning; alignment with audio large language models; and cascaded n-gram language model–based refinement. Contribution/Results: Our method achieves the first direct mapping from neural signals to coherent, grammatically plausible text. On the Brain-to-Text ’24/’25 benchmarks, it establishes new state-of-the-art performance, reducing word error rate from 24.69% to 10.22%. Crucially, it successfully aligns and generalizes across attempted and imagined speech representations—enabling a novel paradigm for language restoration in aphasic patients.

Technology Category

Application Category

📝 Abstract

Speech brain-computer interfaces (BCIs) aim to restore communication for people with paralysis by translating neural activity into text. Most systems use cascaded frameworks that decode phonemes before assembling sentences with an n-gram language model (LM), preventing joint optimization of all stages simultaneously. Here, we introduce an end-to-end Brain-to-Text (BIT) framework that translates neural activity into coherent sentences using a single differentiable neural network. Central to our approach is a cross-task, cross-species pretrained neural encoder, whose representations transfer to both attempted and imagined speech. In a cascaded setting with an n-gram LM, the pretrained encoder establishes a new state-of-the-art (SOTA) on the Brain-to-Text '24 and '25 benchmarks. Integrated end-to-end with audio large language models (LLMs) and trained with contrastive learning for cross-modal alignment, BIT reduces the word error rate (WER) of the prior end-to-end method from 24.69% to 10.22%. Notably, we find that small-scale audio LLMs markedly improve end-to-end decoding. Beyond record-setting performance, BIT aligns attempted and imagined speech embeddings to enable cross-task generalization. Altogether, our approach advances the integration of large, diverse neural datasets, paving the way for an end-to-end decoding framework that supports seamless, differentiable optimization.

Problem

Research questions and friction points this paper is trying to address.

Decoding neural activity into coherent sentences for speech BCIs.

Overcoming limitations of cascaded frameworks with joint optimization.

Enabling cross-task generalization between attempted and imagined speech.

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end neural network translates neural activity to sentences

Cross-task cross-species pretrained encoder transfers to speech types

Integration with audio LLMs and contrastive learning reduces error rate

🔎 Similar Papers

No similar papers found.