A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This study addresses the limitations of conventional character- or word-error-rate metrics in evaluating end-to-end automatic speech recognition (ASR) systems, which often fail to capture the full spectrum of transcription quality. Focusing on French, the work proposes a multidimensional evaluation framework that integrates both linguistic and acoustic perspectives to overcome the constraints of single-metric assessments. Through systematic comparisons of various subword tokenization strategies—such as Byte Pair Encoding (BPE)—and prominent self-supervised speech representation models within end-to-end ASR architectures, the research elucidates how these components influence transcription accuracy and fluency. The resulting framework not only offers a more comprehensive and application-oriented approach to ASR evaluation but also establishes an empirical foundation for optimizing downstream French ASR systems.

📝 Abstract

The performance of end-to-end automatic speech recognition (ASR) systems enables their increasing integration into numerous applications. While there are various benefits to such speech-to-text systems, the choice of hyperparameters and models plays a crucial role in their performance. Typically, these choices are determined by considering only the character (CER) and/or word error rate (WER) metrics. However, it has been shown in several studies that these metrics are largely incomplete and fail to adequately describe the downstream application of automatic transcripts. In this paper, we conduct a qualitative study on the French language that investigates the impact of subword tokenization algorithms and self-supervised learning models from different linguistic and acoustic perspectives, using a comprehensive set of evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

automatic speech recognition

tokenization

self-supervised learning

evaluation metrics

French language

Innovation

Methods, ideas, or system contributions that make the work stand out.

subword tokenization

self-supervised learning

end-to-end ASR