🤖 AI Summary
To address the weak semantic expressivity and limited biological interpretability of existing speech representations, this paper proposes AuriStream—a two-stage auditory-inspired speech representation framework. The first stage maps raw audio to time-frequency representations grounded in cochlear physiology and introduces *cochlear tokens*—biologically interpretable, discrete speech units—novelly derived from this mechanism. The second stage employs an autoregressive sequence model over these tokens to learn phonemic and lexical semantics. AuriStream jointly supports representation learning and generative capabilities, enabling speech continuation, audible waveform synthesis, and spectrogram visualization. Evaluated on the SUPERB benchmark, it achieves state-of-the-art performance, demonstrating that cochlea-motivated discretization significantly enhances semantic representation capacity in speech processing.
📝 Abstract
We introduce AuriStream, a biologically inspired model for encoding speech via a two-stage framework inspired by the human auditory processing hierarchy. The first stage transforms raw audio into a time-frequency representation based on the human cochlea, from which we extract discrete extbf{cochlear tokens}. The second stage applies an autoregressive sequence model over the cochlear tokens. AuriStream learns meaningful phoneme and word representations, and state-of-the-art lexical semantics. AuriStream shows competitive performance on diverse downstream SUPERB speech tasks. Complementing AuriStream's strong representational capabilities, it generates continuations of audio which can be visualized in a spectrogram space and decoded back into audio, providing insights into the model's predictions. In summary, we present a two-stage framework for speech representation learning to advance the development of more human-like models that efficiently handle a range of speech-based tasks.