Pisets: A Robust Speech Recognition System for Lectures and Interviews

📅 2026-01-26
🏛️ North American Chapter of the Association for Computational Linguistics
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the susceptibility of Whisper to transcription errors and hallucinations in long-form audio scenarios such as lectures and interviews. To mitigate these issues, the authors propose a three-stage cascaded architecture: an initial transcription is generated using Wav2Vec2, followed by false-positive filtering via an Audio Spectrogram Transformer (AST), and finally refined output production by Whisper. The approach integrates uncertainty modeling and curriculum learning, and is trained on a diverse multilingual Russian speech corpus. Experimental results demonstrate that the proposed system significantly outperforms both Whisper and WhisperX across varying acoustic conditions, yielding substantial improvements in accuracy and robustness for long-form audio transcription. The implementation has been made publicly available.

Technology Category

Application Category

📝 Abstract
This work presents a speech-to-text system"Pisets"for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of"Pisets"system is publicly available at GitHub: https://github.com/bond005/pisets.
Problem

Research questions and friction points this paper is trying to address.

speech recognition
hallucinations
robustness
long audio transcription
acoustic conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-component architecture
false positive filtering
curriculum learning
uncertainty modeling
robust speech recognition
🔎 Similar Papers
No similar papers found.
Ivan Bondarenko
Ivan Bondarenko
Researcher, Laboratory of Applied Digital Technologies, Novosibirsk State University
Deep LearningNatural Language ProcessingAutomatic Speech RecognitionAutomated Machine LearningFew-Shot Learning
D
Daniil Grebenkin
Novosibirsk State University, Siberian Neuronets LLC
O
Oleg Sedukhin
Siberian Neuronets LLC
M
Mikhail Klementev
Novosibirsk State University, Siberian Neuronets LLC
R
Roman Derunets
Novosibirsk State University, Siberian Neuronets LLC
L
Lyudmila Budneva
Novosibirsk State University