Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

📅 2024-09-10

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 1

career value

162K/year

🤖 AI Summary

This paper addresses the speaker label permutation ambiguity in multi-speaker automatic speech recognition (MS-ASR) by proposing Sortformer, a unified end-to-end framework jointly modeling speaker diarization (SD) and ASR. Methodologically, it introduces (1) a novel Sort Loss that replaces conventional permutation-invariant loss (PIL) to resolve label assignment ambiguity directly; (2) a sinusoidal-kernel-based speaker label estimation module embedded within the ASR encoder, enabling fine-grained temporal alignment between speaker timestamps and ASR tokens—first of its kind; and (3) adapter-based fine-tuning for joint end-to-end SD–ASR training. Evaluated on standard benchmarks, Sortformer achieves state-of-the-art speaker diarization performance while significantly improving MS-ASR accuracy. The code and models will be publicly released within the NVIDIA NeMo framework.

Technology Category

Application Category

📝 Abstract

We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challenge. Most prior end-to-end diarization systems employ permutation invariant loss (PIL), which optimizes for the permutation that yields the lowest error. In contrast, we introduce Sort Loss, which enables a diarization model to autonomously resolve permutation, with or without PIL. We demonstrate that combining Sort Loss and PIL achieves performance competitive with state-of-the-art end-to-end diarization models trained exclusively with PIL. Crucially, we present a streamlined multispeaker ASR architecture that leverages Sortformer as a speaker supervision model, embedding speaker label estimation within the ASR encoder state using a sinusoidal kernel function. This approach resolves the speaker permutation problem through sorted objectives, effectively bridging speaker-label timestamps and speaker tokens. In our experiments, we show that the proposed multispeaker ASR architecture, enhanced with speaker supervision, improves performance via adapter techniques. Code and trained models will be made publicly available via the NVIDIA NeMo framework.

Problem

Research questions and friction points this paper is trying to address.

Resolves speaker permutation in speech-to-text systems

Improves multi-speaker transcription accuracy

Integrates speaker tagging into foundational speech models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sortformer uses Sort Loss for permutation resolution

Embeds speaker labels via sinusoidal kernel functions

Streamlines multi-speaker speech-to-text architecture

🔎 Similar Papers

LLM-based speaker diarization correction: A generalizable approach