🤖 AI Summary
This study addresses the challenge that existing discrete speech unit (DSU) representations struggle to preserve suprasegmental features—particularly lexical tone—during quantization, a limitation especially pronounced in tonal languages such as Mandarin and Yoruba. To mitigate this issue, the authors propose a two-stage K-means quantization strategy: first encoding segmental information and then clustering residual representations to better retain tonal characteristics. Leveraging self-supervised speech representations and multiple quantization approaches, the work systematically evaluates the capacity of DSUs to encode lexical tone. Experimental results demonstrate that the proposed method significantly outperforms current quantization schemes, effectively preserving tonal information while maintaining segmental structure. These findings highlight inherent limitations in existing DSU frameworks for tonal modeling and offer a viable pathway for improvement.
📝 Abstract
Discrete speech units (DSUs) are derived by quantising representations from models trained using self-supervised learning (SSL). They are a popular representation for a wide variety of spoken language tasks, including those where prosody matters. DSUs are especially convenient for tasks where text and speech are jointly modelled, such as text-to-speech and multimodal dialogue systems. But we have found that DSUs encode suprasegmental information less reliably than segmental structure, which we demonstrate in this work using lexical tone, though this limitation likely extends to other suprasegmental features such as prosody.
Our investigations using the tone languages Mandarin and Yorùbá show that the SSL latent representations themselves do encode tone, yet DSUs obtained using quantisation tend to prioritise phonetic structure, which makes lexical tone less reliably encoded. This remains true for a variety of quantisation methods, not only the most common, K-means.
We conclude that current DSU quantisation strategies have limitations for suprasegmental features, which suggests a need for new, tone-aware (or prosody-aware) techniques in speech representation learning. We point towards a potential form of the solution by performing K-means clustering once to encode phonetic information, then again on the residual representation, which better encodes lexical tone.