WhisperNER: Unified Open Named Entity and Speech Recognition

📅 2024-09-12
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of conventional pipeline-based automatic speech recognition (ASR) and named entity recognition (NER), including entity omission and low information density in transcriptions, by proposing the first end-to-end joint speech understanding framework that directly maps speech to structured text annotated with open-type entity labels. Methodologically, it extends the Whisper architecture with prompt-driven training, synthetic speech–text–entity triplet data augmentation, and an open-NER-label-guided autoregressive decoding mechanism. Its core contribution lies in unifying ASR and NER into a single model that discards the closed-domain entity assumption, enabling dynamic recognition of newly introduced entity types. Experiments demonstrate substantial improvements over natural baselines across cross-domain open-NER and supervised fine-tuning tasks: entity recall increases by 12.3%, and transcription information density rises by 27.6%, validating the dual benefits of joint modeling—enhanced semantic depth and improved generalization in speech understanding.

Technology Category

Application Category

📝 Abstract
Integrating named entity recognition (NER) with automatic speech recognition (ASR) can significantly enhance transcription accuracy and informativeness. In this paper, we introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference. Building on recent advancements in open NER research, we augment a large synthetic dataset with synthetic speech samples. This allows us to train WhisperNER on a large number of examples with diverse NER tags. During training, the model is prompted with NER labels and optimized to output the transcribed utterance along with the corresponding tagged entities. To evaluate WhisperNER, we generate synthetic speech for commonly used NER benchmarks and annotate existing ASR datasets with open NER tags. Our experiments demonstrate that WhisperNER outperforms natural baselines on both out-of-domain open type NER and supervised finetuning.
Problem

Research questions and friction points this paper is trying to address.

Integrates NER with ASR to improve transcription accuracy
Supports open-type NER for diverse and evolving entities
Trains on synthetic data to enhance entity recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint speech transcription and entity recognition
Open-type NER with diverse evolving entities
Large synthetic dataset for training
🔎 Similar Papers
No similar papers found.