dMel: Speech Tokenization made Simple

📅 2024-07-22

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing speech tokenization methods rely on audio compressors, incurring high computational overhead and poor cross-domain generalization. This paper proposes dMel—a training-free, streaming-capable, and robust discrete speech representation—achieved by intensity-based binning of energy per frequency band in log-Mel spectrograms, enabling lightweight tokenization. Its core innovation lies in the first unified optimization of text-to-speech (TTS) and automatic speech recognition (ASR) within a single LM-style Transformer architecture; it employs parallel high-dimensional token encoding/decoding to jointly balance efficiency and representational capacity. Experiments demonstrate that dMel matches or surpasses task-specific models in both synthesis and recognition performance, while significantly reducing computational complexity and deployment barriers. By eliminating the need for auxiliary neural compressors and dedicated training, dMel establishes a simple, efficient, and scalable representation paradigm for speech foundation models.

Technology Category

Application Category

📝 Abstract

Large language models have revolutionized natural language processing by leveraging self-supervised pretraining on vast textual data. Inspired by this success, researchers have investigated various compression-based speech tokenization methods to discretize continuous speech signals, enabling the application of language modeling techniques to discrete tokens. However, audio compressor introduces additional complexity and computational cost, and often fail on out-of-domain audio signals. In this work, we introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins, creating a simpler yet more effective representation compared to existing speech tokenization methods. Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation. To address the high-dimensional nature of log-mel spectrograms, we propose an efficient parallel encoding and decoding method for high-dimensional tokens using an LM-style transformer architecture. This innovation enables us to develop RichTTS and RichASR, two models sharing the same architecture while achieving comparable or better results than specialized existing methods. Our results demonstrate the effectiveness of dmel in achieving high performance on both speech synthesis and recognition tasks within a unified framework, paving the way for efficient and effective joint modeling of speech and text.

Problem

Research questions and friction points this paper is trying to address.

Simplifying speech tokenization for effective language modeling

Improving robustness to out-of-domain audio signals

Enabling unified modeling for speech synthesis and recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discretizes mel-filterbank channels into intensity bins

Proposes efficient parallel encoding and decoding method

Develops RichTTS and RichASR with unified architecture

🔎 Similar Papers

No similar papers found.