BiRQ: Bi-Level Self-Labeling Random Quantization for Self-Supervised Speech Recognition

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing self-supervised speech recognition methods face a trade-off: high-performing approaches like HuBERT rely on external encoders and involve complex pipelines, while efficient alternatives such as BEST-RQ suffer from low-quality pseudo-labels. This work proposes BiRQ, the first framework integrating Random Projection Quantization (RPQ) with a dual-level self-labeling mechanism that leverages the model’s internal intermediate representations for hierarchical pseudo-label refinement—eliminating external encoders entirely and enabling end-to-end differentiable training and iterative optimization. Key innovations include Gumbel-Softmax-based differentiable discrete selection and first-order gradient-based bilevel optimization. Evaluated on LibriSpeech, AMI, and YODAS, BiRQ consistently outperforms BEST-RQ, achieving substantial ASR accuracy gains while maintaining low computational overhead. Results demonstrate BiRQ’s effectiveness, training stability, and scalability across diverse domains and dataset sizes.

Technology Category

Application Category

📝 Abstract
Speech is a rich signal, and labeled audio-text pairs are costly, making self-supervised learning essential for scalable representation learning. A core challenge in speech SSL is generating pseudo-labels that are both informative and efficient: strong labels, such as those used in HuBERT, improve downstream performance but rely on external encoders and multi-stage pipelines, while efficient methods like BEST-RQ achieve simplicity at the cost of weaker labels. We propose BiRQ, a bilevel SSL framework that combines the efficiency of BEST-RQ with the refinement benefits of HuBERT-style label enhancement. The key idea is to reuse part of the model itself as a pseudo-label generator: intermediate representations are discretized by a random-projection quantizer to produce enhanced labels, while anchoring labels derived directly from the raw input stabilize training and prevent collapse. Training is formulated as an efficient first-order bilevel optimization problem, solved end-to-end with differentiable Gumbel-softmax selection. This design eliminates the need for external label encoders, reduces memory cost, and enables iterative label refinement in an end-to-end fashion. BiRQ consistently improves over BEST-RQ while maintaining low complexity and computational efficiency. We validate our method on various datasets, including 960-hour LibriSpeech, 150-hour AMI meetings and 5,000-hour YODAS, demonstrating consistent gains over BEST-RQ.
Problem

Research questions and friction points this paper is trying to address.

Self-supervised speech recognition without external label encoders
Combining efficient quantization with refined pseudo-label generation
Eliminating multi-stage pipelines while maintaining low computational complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bilevel SSL with self-labeling refinement
Reuses model parts as pseudo-label generator
Solves with efficient bilevel optimization end-to-end
🔎 Similar Papers
No similar papers found.
L
Liuyuan Jiang
University of Rochester
X
Xiaodong Cui
IBM Research
Brian Kingsbury
Brian Kingsbury
Distinguished Research Staff Member and Manager, IBM T. J. Watson Research Center, Yorktown Heights
Automatic Speech RecognitionSpoken Term DetectionDeep Learning
T
Tianyi Chen
Cornell University
L
Lisha Chen
University of Rochester