Joint Beamforming and Speaker-Attributed ASR for Real Distant-Microphone Meeting Transcription

πŸ“… 2024-10-29
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In far-field multi-speaker meeting transcription, existing end-to-end speaker-attributed automatic speech recognition (SA-ASR) systems suffer from limited robustness due to their inability to suppress multi-channel noise and reverberation. To address this, we propose an end-to-end jointly optimized framework integrating neural beamforming with SA-ASR: (1) the first fully neural beamformer jointly trained with an SA-ASR model; and (2) a beamforming pretraining strategy aligned with and enhanced by real meeting data. Evaluated on the AMI corpus, joint fine-tuning of the full neural beamforming frontend reduces word error rate (WER) by 9% relatively over SA-ASR fine-tuning alone, while fine-tuning with a fixed beamformer yields an 8% relative WER reduction. This work significantly improves the robustness and accuracy of end-to-end SA-ASR in challenging far-field, overlapping-speech scenarios.

Technology Category

Application Category

πŸ“ Abstract
Distant-microphone meeting transcription is a challenging task. State-of-the-art end-to-end speaker-attributed automatic speech recognition (SA-ASR) architectures lack a multichannel noise and reverberation reduction front-end, which limits their performance. In this paper, we introduce a joint beamforming and SA-ASR approach for real meeting transcription. We first describe a data alignment and augmentation method to pretrain a neural beamformer on real meeting data. We then compare fixed, hybrid, and fully neural beamformers as front-ends to the SA-ASR model. Finally, we jointly optimize the fully neural beamformer and the SA-ASR model. Experiments on the real AMI corpus show that,while state-of-the-art multi-frame cross-channel attention based channel fusion fails to improve ASR performance, fine-tuning SA-ASR on the fixed beamformer's output and jointly fine-tuning SA-ASR with the neural beamformer reduce the word error rate by 8% and 9% relative, respectively.
Problem

Research questions and friction points this paper is trying to address.

Enhances distant-microphone meeting transcription accuracy
Integrates beamforming with speaker-attributed ASR
Reduces word error rate in real meetings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint beamforming and SA-ASR for meeting transcription
Neural beamformer pretrained on real meeting data
Joint optimization of beamformer and SA-ASR model
πŸ”Ž Similar Papers
No similar papers found.
C
Can Cui
UniversitΓ© de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France
S
Sheikh Imran Ahamad
Vivoka, Metz, France
Mostafa Sadeghi
Mostafa Sadeghi
Inria
generative modelsprobabilistic machine learningaudio-visual speech processing
Emmanuel Vincent
Emmanuel Vincent
Senior Research Scientist, Inria
speech & audio