Modeling Overlapped Speech with Shuffles

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes an end-to-end trainable, single-pass alignment approach to address the challenges of aligning and transcribing overlapping speech from multiple speakers. The method introduces, for the first time, the shuffle product and partially ordered finite-state automata (FSA) to directly model (token, speaker) tuples and marginalize over all possible serialized paths at subword, word, and phrase levels. Implemented within the k2/Icefall framework and integrated with Viterbi alignment, the system achieves highly accurate simultaneous alignment and speaker-attributed transcription on synthetic LibriSpeech overlapping speech data, significantly advancing speech recognition performance in multi-speaker scenarios.

Technology Category

Application Category

📝 Abstract
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.
Problem

Research questions and friction points this paper is trying to address.

overlapped speech
speaker-attributed transcription
alignment
multi-talker recordings
shuffle product
Innovation

Methods, ideas, or system contributions that make the work stand out.

shuffle product
partial order FSA
overlapped speech
speaker-attributed transcription
one-pass alignment
🔎 Similar Papers
No similar papers found.