Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

๐Ÿ“… 2024-09-10
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 5
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the speaker label permutation ambiguity in multi-speaker automatic speech recognition (MS-ASR) by proposing Sortformer, a unified end-to-end framework jointly modeling speaker diarization (SD) and ASR. Methodologically, it introduces (1) a novel Sort Loss that replaces conventional permutation-invariant loss (PIL) to resolve label assignment ambiguity directly; (2) a sinusoidal-kernel-based speaker label estimation module embedded within the ASR encoder, enabling fine-grained temporal alignment between speaker timestamps and ASR tokensโ€”first of its kind; and (3) adapter-based fine-tuning for joint end-to-end SDโ€“ASR training. Evaluated on standard benchmarks, Sortformer achieves state-of-the-art speaker diarization performance while significantly improving MS-ASR accuracy. The code and models will be publicly released within the NVIDIA NeMo framework.

Technology Category

Application Category

๐Ÿ“ Abstract
We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challenge. Most prior end-to-end diarization systems employ permutation invariant loss (PIL), which optimizes for the permutation that yields the lowest error. In contrast, we introduce Sort Loss, which enables a diarization model to autonomously resolve permutation, with or without PIL. We demonstrate that combining Sort Loss and PIL achieves performance competitive with state-of-the-art end-to-end diarization models trained exclusively with PIL. Crucially, we present a streamlined multispeaker ASR architecture that leverages Sortformer as a speaker supervision model, embedding speaker label estimation within the ASR encoder state using a sinusoidal kernel function. This approach resolves the speaker permutation problem through sorted objectives, effectively bridging speaker-label timestamps and speaker tokens. In our experiments, we show that the proposed multispeaker ASR architecture, enhanced with speaker supervision, improves performance via adapter techniques. Code and trained models will be made publicly available via the NVIDIA NeMo framework.
Problem

Research questions and friction points this paper is trying to address.

Resolves speaker permutation in speech-to-text systems
Improves multi-speaker transcription accuracy
Integrates speaker tagging into foundational speech models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sortformer uses Sort Loss for permutation resolution
Embeds speaker labels via sinusoidal kernel functions
Streamlines multi-speaker speech-to-text architecture
๐Ÿ”Ž Similar Papers
No similar papers found.