Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of disentangling linguistic content from non-linguistic factors—such as speaker identity—in speech signals, which are inherently highly coupled and impede the extraction of clean semantic representations. To this end, the authors propose Kanade, a single-layer disentangled speech tokenizer that leverages an acoustic invariance mechanism to directly produce a unified token stream from raw audio. Without requiring auxiliary supervision or complex architectural components, Kanade effectively suppresses speaker-related variations while preserving rich phonetic and prosodic information. The method achieves state-of-the-art performance in speaker disentanglement and lexical recoverability, all while maintaining high-fidelity speech reconstruction, demonstrating that simplicity and efficacy can coexist in self-supervised speech representation learning.

Technology Category

Application Category

📝 Abstract
A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.
Problem

Research questions and friction points this paper is trying to address.

speech tokenization
disentanglement
spoken language modeling
speaker identity
phonetics
Innovation

Methods, ideas, or system contributions that make the work stand out.

disentangled tokenizer
speech modeling
speaker disentanglement
prosody
lexical availability
🔎 Similar Papers
No similar papers found.
Z
Zhijie Huang
The University of Tokyo, Tokyo, Japan
S
Stephen McIntosh
The University of Tokyo, Tokyo, Japan
D
Daisuke Saito
The University of Tokyo, Tokyo, Japan
Nobuaki Minematsu
Nobuaki Minematsu
The University of Tokyo
Speech CommunicationForeign Language Learning