Modeling Overlapped Speech with Shuffles

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work proposes an end-to-end trainable, single-pass alignment approach to address the challenges of aligning and transcribing overlapping speech from multiple speakers. The method introduces, for the first time, the shuffle product and partially ordered finite-state automata (FSA) to directly model (token, speaker) tuples and marginalize over all possible serialized paths at subword, word, and phrase levels. Implemented within the k2/Icefall framework and integrated with Viterbi alignment, the system achieves highly accurate simultaneous alignment and speaker-attributed transcription on synthetic LibriSpeech overlapping speech data, significantly advancing speech recognition performance in multi-speaker scenarios.

Technology Category

Application Category

📝 Abstract

We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.

Problem

Research questions and friction points this paper is trying to address.

overlapped speech

speaker-attributed transcription

alignment

multi-talker recordings

shuffle product

Innovation

Methods, ideas, or system contributions that make the work stand out.

shuffle product

partial order FSA

overlapped speech

speaker-attributed transcription

one-pass alignment

🔎 Similar Papers

No similar papers found.

Amazon

Bellevue, WA / Boston, MA / Cambridge, MA

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs