Representing Speech Through Autoregressive Prediction of Cochlear Tokens

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak semantic expressivity and limited biological interpretability of existing speech representations, this paper proposes AuriStream—a two-stage auditory-inspired speech representation framework. The first stage maps raw audio to time-frequency representations grounded in cochlear physiology and introduces *cochlear tokens*—biologically interpretable, discrete speech units—novelly derived from this mechanism. The second stage employs an autoregressive sequence model over these tokens to learn phonemic and lexical semantics. AuriStream jointly supports representation learning and generative capabilities, enabling speech continuation, audible waveform synthesis, and spectrogram visualization. Evaluated on the SUPERB benchmark, it achieves state-of-the-art performance, demonstrating that cochlea-motivated discretization significantly enhances semantic representation capacity in speech processing.

Technology Category

Application Category

📝 Abstract
We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete extbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.
Problem

Research questions and friction points this paper is trying to address.

Encode speech using cochlear tokens and autoregressive modeling
Learn phoneme and word representations for speech tasks
Generate continuations of audio for model prediction insights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage cochlear token encoding
Autoregressive sequence modeling
Biologically inspired auditory processing
🔎 Similar Papers
No similar papers found.
Greta Tuckute
Greta Tuckute
Post-doc, Brain and Cognitive Sciences, MIT
Cognitive neuroscienceartificial intelligence
Klemen Kotar
Klemen Kotar
PhD Candidate, Stanford University
Artificial Intelligence
E
Evelina Fedorenko
Department of Brain and Cognitive Sciences & McGovern Institute for Brain Research, MIT, USA
D
Daniel L. K. Yamins
Department of Computer Science & Wu Tsai Neurosciences Institute, Stanford University, USA