Representing Speech Through Autoregressive Prediction of Cochlear Tokens

📅 2025-08-15

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the weak semantic expressivity and limited biological interpretability of existing speech representations, this paper proposes AuriStream—a two-stage auditory-inspired speech representation framework. The first stage maps raw audio to time-frequency representations grounded in cochlear physiology and introduces *cochlear tokens*—biologically interpretable, discrete speech units—novelly derived from this mechanism. The second stage employs an autoregressive sequence model over these tokens to learn phonemic and lexical semantics. AuriStream jointly supports representation learning and generative capabilities, enabling speech continuation, audible waveform synthesis, and spectrogram visualization. Evaluated on the SUPERB benchmark, it achieves state-of-the-art performance, demonstrating that cochlea-motivated discretization significantly enhances semantic representation capacity in speech processing.

Technology Category

Application Category

📝 Abstract

We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete extbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.

Problem

Research questions and friction points this paper is trying to address.

Encode speech using cochlear tokens and autoregressive modeling

Learn phoneme and word representations for speech tasks

Generate continuations of audio for model prediction insights

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage cochlear token encoding

Autoregressive sequence modeling

Biologically inspired auditory processing

🔎 Similar Papers

Sylber: Syllabic Embedding Representation of Speech from Raw Audio