Differentiable K-means for Fully-optimized Discrete Token-based ASR

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Discrete speech tokens extracted from existing self-supervised learning (SSL) models are typically obtained via downstream-task-agnostic k-means clustering, yielding suboptimal representations for automatic speech recognition (ASR). Method: We propose differentiable k-means—a novel method that integrates k-means clustering into an end-to-end differentiable training framework for the first time—enabling joint optimization of SSL feature clustering and ASR. Our approach jointly learns model parameters (e.g., in wav2vec 2.0), layer-wise feature weights, and cluster centroids, while incorporating a multi-layer feature weighting and fusion mechanism. Contribution/Results: Experiments demonstrate substantial ASR performance gains, higher phoneme purity in learned tokens, and strong cross-task generalization—e.g., in speech resynthesis. This work establishes the first differentiable, trainable paradigm for discrete speech token generation, effectively bridging the gap between SSL representation learning and downstream discrete-token adaptation.

Technology Category

Application Category

📝 Abstract

Recent studies have highlighted the potential of discrete tokens derived from self-supervised learning (SSL) models for various speech-related tasks. These tokens serve not only as substitutes for text in language modeling but also as intermediate representations for tasks such as automatic speech recognition (ASR). However, discrete tokens are typically obtained via k-means clustering of SSL features independently of downstream tasks, making them suboptimal for specific applications. This paper proposes the use of differentiable k-means, enabling the joint optimization of tokenization and downstream tasks. This approach enables the fine-tuning of the SSL parameters and learning weights for outputs from multiple SSL layers. Experiments were conducted with ASR as a downstream task. ASR accuracy successfully improved owing to the optimized tokens. The acquired tokens also exhibited greater purity of phonetic information, which were found to be useful even in speech resynthesis.

Problem

Research questions and friction points this paper is trying to address.

Optimizing discrete tokens for ASR using differentiable k-means

Jointly fine-tuning SSL features and downstream task parameters

Improving token purity for phonetic information and speech tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differentiable k-means for joint optimization

Fine-tuning SSL parameters and layer weights

Improved ASR accuracy with optimized tokens

🔎 Similar Papers

No similar papers found.