SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

To address the scarcity of high-quality, speaker-annotated automatic speech recognition (ASR) datasets in finance, this work introduces FinSpeaker—the first large-scale, multi-speaker ASR dataset specifically designed for earnings call transcripts. It comprises 3,780 hours of professionally transcribed audio, fully annotated with speaker identity, functional role (e.g., CEO, analyst), and dialogue structure. FinSpeaker enables end-to-end speaker-aware ASR modeling, substantially broadening the scope of financial speech processing tasks. When fine-tuned on mainstream models—including Whisper and Wav2Vec 2.0—FinSpeaker yields a 12.6% improvement in speaker diarization accuracy and a 9.3% reduction in word error rate compared to generic ASR benchmarks. Crucially, it establishes the first systematic, non-commercial academic foundation for speaker-perceptive ASR in financial domains.

Technology Category

Application Category

📝 Abstract

We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.

Problem

Research questions and friction points this paper is trying to address.

Enhancing speaker-tagged transcription in financial audio

Improving multi-talker ASR with call and speaker metadata

Expanding dataset diversity for end-to-end ASR tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-speaker financial audio transcription

Speaker-tagged ASR performance improvement

Diverse modeling tasks with formatted transcriptions

🔎 Similar Papers

No similar papers found.