Multi-Utterance Speech Separation and Association Trained on Short Segments

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing deep neural network (DNN) models for multi-utterance speech separation and speaker diarization in long recordings are trained on short segments (e.g., 10 s), limiting their generalization to real-world long audio (21–121 s) containing multiple silence gaps. Method: We propose the Frequency-Time Recurrent Neural Network (FTRNN), a dual-path recurrent architecture: a full-band module captures intra-frame frequency dependencies, while a sub-band module models long-range temporal patterns—enabling end-to-end speaker-consistent separation across silence gaps without segment-wise stitching. Contribution/Results: With only 0.9M parameters, FTRNN significantly outperforms prior methods on unseen lengths and silence durations. It is the first to achieve robust separation and speaker association under the short-training, long-inference paradigm, demonstrating strong generalization capability without length-specific adaptation or post-hoc alignment.

Technology Category

Application Category

📝 Abstract

Current deep neural network (DNN) based speech separation faces a fundamental challenge -- while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this challenging scenario and propose a frequency-temporal recurrent neural network (FTRNN) that effectively bridges this gap. Our FTRNN employs a full-band module to model frequency dependencies within each time frame and a sub-band module that models temporal patterns in each frequency band. Despite being trained on short fixed-length segments of 10 s, our model demonstrates robust separation when processing signals significantly longer than training segments (21-121 s) and preserves speaker association across utterance gaps exceeding those seen during training. Unlike the conventional segment-separation-stitch paradigm, our lightweight approach (0.9 M parameters) performs inference on long audio without segmentation, eliminating segment boundary distortions while simplifying deployment. Experimental results demonstrate the generalization ability of FTRNN for multi-utterance speech separation and speaker association.

Problem

Research questions and friction points this paper is trying to address.

Bridging gap between short training segments and long real-world audio

Enhancing multi-utterance speech separation and speaker association

Eliminating segmentation artifacts in long audio processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

FTRNN combines full-band and sub-band modules

Lightweight model processes long audio directly

Preserves speaker association across utterance gaps

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models