Frontend Token Enhancement for Token-Based Speech Recognition

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the significant performance degradation of discrete speech representations—such as semantic or phoneme tokens—in noisy environments, which limits the effectiveness of automatic speech recognition (ASR) systems relying on such representations. To mitigate this issue, the authors propose a front-end enhancement system that directly estimates clean discrete speech tokens from noisy input, trained independently of the ASR back-end. The study presents the first systematic comparison among four enhancement paradigms: waveform-to-waveform, token-to-token, continuous self-supervised learning (SSL) features-to-token, and waveform-to-token. The results demonstrate that the waveform-to-token approach consistently outperforms the others. Experiments on the CHiME-4 dataset show that this method not only surpasses alternative enhancement strategies but also exceeds the performance of ASR systems based on continuous SSL features in most conditions.

Technology Category

Application Category

📝 Abstract

Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.

Problem

Research questions and friction points this paper is trying to address.

discretized speech representations

environmental noise

token-based ASR

speech token degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-based speech recognition

frontend enhancement

wave-to-token

self-supervised learning