🤖 AI Summary
This work addresses the challenge of disentangling linguistic content from non-linguistic factors—such as speaker identity—in speech signals, which are inherently highly coupled and impede the extraction of clean semantic representations. To this end, the authors propose Kanade, a single-layer disentangled speech tokenizer that leverages an acoustic invariance mechanism to directly produce a unified token stream from raw audio. Without requiring auxiliary supervision or complex architectural components, Kanade effectively suppresses speaker-related variations while preserving rich phonetic and prosodic information. The method achieves state-of-the-art performance in speaker disentanglement and lexical recoverability, all while maintaining high-fidelity speech reconstruction, demonstrating that simplicity and efficacy can coexist in self-supervised speech representation learning.
📝 Abstract
A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.