Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

📅 2024-09-13
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses fine-grained speech transcription in multi-speaker overlapping speech scenarios. We propose the first instruction-driven, large language model (LLM)-based automatic speech recognition (ASR) framework for speaker-discriminative transcription. Methodologically, we design a multimodal speech representation integrating acoustic, speaker-identity, and semantic information, extracted via a joint WavLM–Whisper encoder; an LLM is then efficiently fine-tuned using LoRA for end-to-end modeling. The framework supports zero-shot generalization, enabling dynamic specification of target speakers—by gender, speaking order, language, or keyword—and precise transcription without speaker-specific adaptation. Experiments demonstrate substantial improvements over conventional ASR systems on challenging benchmarks such as the Cocktail Party dataset, with real-time instruction responsiveness. To foster reproducibility and further research, our code, models, and test cases are publicly released.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) have revolutionized various domains, bringing significant progress and new opportunities. Despite progress in speech-related tasks, LLMs have not been sufficiently explored in multi-talker scenarios. In this work, we present a pioneering effort to investigate the capability of LLMs in transcribing speech in multi-talker environments, following versatile instructions related to multi-talker automatic speech recognition (ASR), target talker ASR, and ASR based on specific talker attributes such as sex, occurrence order, language, and keyword spoken. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. These representations are then fed into an LLM fine-tuned using LoRA, enabling the capabilities for speech comprehension and transcription. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios, highlighting the potential of LLM to handle speech-related tasks based on user instructions in such complex settings. The code, model, and samples are available at https://github.com/cuhealthybrains/MT-LLM.
Problem

Research questions and friction points this paper is trying to address.

LLMs untested in multi-talker speech transcription scenarios
Transcribing speech with versatile multi-talker ASR instructions
Handling speech tasks in complex cocktail party settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs transcribe multi-talker speech with instructions
WavLM and Whisper extract speaker-sensitive representations
LoRA fine-tuned LLM handles complex speech tasks
🔎 Similar Papers
No similar papers found.