🤖 AI Summary
Discrete speech tokens exhibit insufficient representational capacity for prosody modeling, yet their sensitivity to prosodic features remains systematically unexamined.
Method: We extract continuous representations from diverse self-supervised speech (SSL) models, discretize them via k-means clustering, and conduct controlled prosodic perturbation experiments—quantifying token fidelity to pitch, duration, and energy through classification and regression tasks. We introduce the first reproducible evaluation framework specifically designed for prosody sensitivity of discrete tokens.
Results: Our empirical analysis reveals that both SSL model architecture and k-means hyperparameters critically influence prosodic encoding fidelity, exposing fundamental limitations of current discretization paradigms. We provide empirically grounded configuration guidelines to optimize prosody preservation. This work establishes a theoretical foundation and practical methodology for designing and evaluating prosody-aware discrete tokens in speech-language models.
📝 Abstract
Recently, discrete tokens derived from self-supervised learning (SSL) models via k-means clustering have been actively studied as pseudo-text in speech language models and as efficient intermediate representations for various tasks. However, these discrete tokens are typically learned in advance, separately from the training of language models or downstream tasks. As a result, choices related to discretization, such as the SSL model used or the number of clusters, must be made heuristically. In particular, speech language models are expected to understand and generate responses that reflect not only the semantic content but also prosodic features. Yet, there has been limited research on the ability of discrete tokens to capture prosodic information. To address this gap, this study conducts a comprehensive analysis focusing on prosodic encoding based on their sensitivity to the artificially modified prosody, aiming to provide practical guidelines for designing discrete tokens.