🤖 AI Summary
To address the scarcity of high-quality, speaker-annotated automatic speech recognition (ASR) datasets in finance, this work introduces FinSpeaker—the first large-scale, multi-speaker ASR dataset specifically designed for earnings call transcripts. It comprises 3,780 hours of professionally transcribed audio, fully annotated with speaker identity, functional role (e.g., CEO, analyst), and dialogue structure. FinSpeaker enables end-to-end speaker-aware ASR modeling, substantially broadening the scope of financial speech processing tasks. When fine-tuned on mainstream models—including Whisper and Wav2Vec 2.0—FinSpeaker yields a 12.6% improvement in speaker diarization accuracy and a 9.3% reduction in word error rate compared to generic ASR benchmarks. Crucially, it establishes the first systematic, non-commercial academic foundation for speaker-perceptive ASR in financial domains.
📝 Abstract
We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.