๐ค AI Summary
This paper addresses the speaker label permutation ambiguity in multi-speaker automatic speech recognition (MS-ASR) by proposing Sortformer, a unified end-to-end framework jointly modeling speaker diarization (SD) and ASR. Methodologically, it introduces (1) a novel Sort Loss that replaces conventional permutation-invariant loss (PIL) to resolve label assignment ambiguity directly; (2) a sinusoidal-kernel-based speaker label estimation module embedded within the ASR encoder, enabling fine-grained temporal alignment between speaker timestamps and ASR tokensโfirst of its kind; and (3) adapter-based fine-tuning for joint end-to-end SDโASR training. Evaluated on standard benchmarks, Sortformer achieves state-of-the-art speaker diarization performance while significantly improving MS-ASR accuracy. The code and models will be publicly released within the NVIDIA NeMo framework.
๐ Abstract
We propose Sortformer, a novel neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models. The permutation problem in speaker diarization has long been regarded as a critical challenge. Most prior end-to-end diarization systems employ permutation invariant loss (PIL), which optimizes for the permutation that yields the lowest error. In contrast, we introduce Sort Loss, which enables a diarization model to autonomously resolve permutation, with or without PIL. We demonstrate that combining Sort Loss and PIL achieves performance competitive with state-of-the-art end-to-end diarization models trained exclusively with PIL. Crucially, we present a streamlined multispeaker ASR architecture that leverages Sortformer as a speaker supervision model, embedding speaker label estimation within the ASR encoder state using a sinusoidal kernel function. This approach resolves the speaker permutation problem through sorted objectives, effectively bridging speaker-label timestamps and speaker tokens. In our experiments, we show that the proposed multispeaker ASR architecture, enhanced with speaker supervision, improves performance via adapter techniques. Code and trained models will be made publicly available via the NVIDIA NeMo framework.