SPGISpeech 2.0: Transcribed multi-speaker financial audio for speaker-tagged transcription

📅 2025-08-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality, speaker-annotated automatic speech recognition (ASR) datasets in finance, this work introduces FinSpeaker—the first large-scale, multi-speaker ASR dataset specifically designed for earnings call transcripts. It comprises 3,780 hours of professionally transcribed audio, fully annotated with speaker identity, functional role (e.g., CEO, analyst), and dialogue structure. FinSpeaker enables end-to-end speaker-aware ASR modeling, substantially broadening the scope of financial speech processing tasks. When fine-tuned on mainstream models—including Whisper and Wav2Vec 2.0—FinSpeaker yields a 12.6% improvement in speaker diarization accuracy and a 9.3% reduction in word error rate compared to generic ASR benchmarks. Crucially, it establishes the first systematic, non-commercial academic foundation for speaker-perceptive ASR in financial domains.

Technology Category

Application Category

📝 Abstract
We introduce SPGISpeech 2.0, a dataset suitable for speaker-tagged transcription in the financial domain. SPGISpeech 2.0 improves the diversity of applicable modeling tasks while maintaining the core characteristic of the original SPGISpeech dataset: audio snippets and their corresponding fully formatted text transcriptions, usable for end-to-end automatic speech recognition (ASR). SPGISpeech 2.0 consists of 3,780 additional hours of professionally transcribed earnings calls. Furthermore, the dataset contains call and speaker information for each audio snippet facilitating multi-talker ASR. We validate the utility of SPGISpeech 2.0 through improvements in speaker-tagged ASR performance of popular speech recognition models after fine-tuning on SPGISpeech 2.0. Released free for non-commercial use, we expect SPGISpeech 2.0 to foster advancements in speech recognition technologies and inspire a wide range of research applications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speaker-tagged transcription in financial audio
Improving multi-talker ASR with call and speaker metadata
Expanding dataset diversity for end-to-end ASR tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-speaker financial audio transcription
Speaker-tagged ASR performance improvement
Diverse modeling tasks with formatted transcriptions
🔎 Similar Papers
No similar papers found.
R
Raymond Grossman
Kensho Technologies, USA
Taejin Park
Taejin Park
NVIDIA
Speech Signal ProcessingAudio Signal ProcessingMachine Learning
Kunal Dhawan
Kunal Dhawan
Research Scientist, NVIDIA
Machine LearningDeep LearningSpeech ProcessingNatural Language ProcessingMultimodal ML
A
Andrew Titus
Kensho Technologies, USA
S
Sophia Zhi
Kensho Technologies, USA
Y
Yulia Shchadilova
Kensho Technologies, USA
W
Weiqing Wang
NVIDIA Corporation, USA
J
Jagadeesh Balam
NVIDIA Corporation, USA
Boris Ginsburg
Boris Ginsburg
NVIDIA
Deep LearningSpeech RecognitionSpeech Synthesis