Sequence-Level Unsupervised Training in Speech Recognition: A Theoretical Study

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work investigates the effective utilization of unpaired data in unsupervised speech recognition, establishing for the first time the theoretical conditions under which such learning is feasible and deriving a provable upper bound on the classification error. Building upon this theoretical framework, the authors propose a single-stage sequence-level cross-entropy loss that directly optimizes the sequence-level objective. Through rigorous theoretical analysis, derivation of the error bound, and simulation experiments, they validate the correctness of the proposed bound and demonstrate that the new loss function significantly improves model performance in unsupervised settings. This study thus provides both a solid theoretical foundation and a practical training methodology for advancing unsupervised speech recognition.

Technology Category

Application Category

📝 Abstract

Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.

Problem

Research questions and friction points this paper is trying to address.

unsupervised speech recognition

unpaired data

classification error

theoretical framework

sequence-level training

Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised speech recognition

sequence-level training

classification error bound