🤖 AI Summary
This work investigates the effective utilization of unpaired data in unsupervised speech recognition, establishing for the first time the theoretical conditions under which such learning is feasible and deriving a provable upper bound on the classification error. Building upon this theoretical framework, the authors propose a single-stage sequence-level cross-entropy loss that directly optimizes the sequence-level objective. Through rigorous theoretical analysis, derivation of the error bound, and simulation experiments, they validate the correctness of the proposed bound and demonstrate that the new loss function significantly improves model performance in unsupervised settings. This study thus provides both a solid theoretical foundation and a practical training methodology for advancing unsupervised speech recognition.
📝 Abstract
Unsupervised speech recognition is a task of training a speech recognition model with unpaired data. To determine when and how unsupervised speech recognition can succeed, and how classification error relates to candidate training objectives, we develop a theoretical framework for unsupervised speech recognition grounded in classification error bounds. We introduce two conditions under which unsupervised speech recognition is possible. The necessity of these conditions are also discussed. Under these conditions, we derive a classification error bound for unsupervised speech recognition and validate this bound in simulations. Motivated by this bound, we propose a single-stage sequence-level cross-entropy loss for unsupervised speech recognition.