Incorporating Error Level Noise Embedding for Improving LLM-Assisted Robustness in Persian Speech Recognition

๐Ÿ“… 2025-12-19
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the significant degradation in automatic speech recognition (ASR) performance for low-resource languages like Persian under noisy conditions, this paper proposes a noise-aware multi-hypothesis collaborative error correction framework. Methodologically, it generates a 5-best hypothesis list using an enhanced Whisper-large model; introduces, for the first time, Error-Level Noise (ELN) embeddings to quantify language uncertainty induced by noise at both semantic and token levels; and jointly injects ELN into sentence-level and token-level large language models (LLMs)โ€”specifically fine-tuned LLaMA-2-7Bโ€”to enable confidence-aware correction conditioned on noise characteristics. Evaluated on a mixed-noise Persian test set, the framework reduces word error rate (WER) from 31.10% (baseline Whisper) to 24.84%, substantially outperforming the ELN-free baseline (30.79%). The core contributions are the novel ELN modeling paradigm and its first application in multi-granularity LLM-based error correction.

Technology Category

Application Category

๐Ÿ“ Abstract
Automatic Speech Recognition (ASR) systems suffer significant performance degradation in noisy environments, a challenge that is especially severe for low-resource languages such as Persian. Even state-of-the-art models such as Whisper struggle to maintain accuracy under varying signal-to-noise ratios (SNRs). This study presents a robust noise-sensitive ASR error correction framework that combines multiple hypotheses and noise-aware modeling. Using noisy Persian speech, we generate 5-best hypotheses from a modified Whisper-large decoder. Error Level Noise (ELN) is introduced as a representation that captures semantic- and token-level disagreement across hypotheses, quantifying the linguistic distortions caused by noise. ELN thus provides a direct measure of noise-induced uncertainty, enabling the LLM to reason about the reliability of each hypothesis during correction. Three models are evaluated: (1) a base LLaMA-2-7B model without fine-tuning, (2) a fine-tuned variant trained on text-only hypotheses, and (3) a noise-conditioned model integrating ELN embeddings at both sentence and word levels. Experimental results demonstrate that the ELN-conditioned model achieves substantial reductions in Word Error Rate (WER). Specifically, on the challenging Mixed Noise test set, the proposed Fine-tuned + ELN (Ours) model reduces the WER from a baseline of 31.10% (Raw Whisper) to 24.84%, significantly surpassing the Fine-tuned (No ELN) text-only baseline of 30.79%, whereas the original LLaMA-2-7B model increased the WER to 64.58%, demonstrating that it is unable to correct Persian errors on its own. This confirms the effectiveness of combining multiple hypotheses with noise-aware embeddings for robust Persian ASR in noisy real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Improves Persian speech recognition in noisy environments
Reduces Word Error Rate using noise-aware error correction
Combines multiple hypotheses with Error Level Noise embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Error Level Noise embedding for noise-aware modeling
Combines multiple hypotheses from modified Whisper decoder
Integrates ELN embeddings at both sentence and word levels
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zahra Rahmani
Department of Computer Engineering, Sharif University of Technology
Hossein Sameti
Hossein Sameti
Associate Professor, Sharif University of Technology
Speech Recognition and synthesisSpoken Dialogue systems