TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

📅 2025-04-06
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmark evaluations for self-supervised speech models focus predominantly on single-speaker scenarios, failing to reflect their real-world capability in noisy, multi-speaker environments for target-speaker identification and information extraction. Method: We introduce TS-SUPERB—the first target-speaker-oriented benchmark for multi-speaker noisy conditions—encompassing four tasks: enrollment-guided speech separation, recognition, verification, and synthesis. We propose a unified SSL-based target-speech encoder architecture that jointly optimizes speaker encoding and speech extraction modules, and develop an end-to-end framework integrating self-supervised representations, speaker-embedding-conditioned decoding, and multi-task joint training. Contribution/Results: Experiments reveal that single-speaker performance is not predictive of target-speaker task performance. Our approach achieves significant improvements over prior state-of-the-art across multiple TS tasks, demonstrating the efficacy of cross-task information sharing and joint modeling.

Technology Category

Application Category

📝 Abstract
Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Evaluating SSL models in noisy multi-talker target-speaker scenarios
Benchmarking target-speaker processing tasks with enrollment embeddings
Investigating joint optimization across target-speaker tasks using SSL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TS-SUPERB benchmark for target-speaker tasks
Uses speaker embedding to condition downstream models
Proposes unified SSL-based target speech encoder
🔎 Similar Papers
No similar papers found.
J
Junyi Peng
Brno University of Technology, Czechia
Takanori Ashihara
Takanori Ashihara
NTT
Marc Delcroix
Marc Delcroix
NTT Communication Science Laboratories
Speech processingRobust ASRSpeech enhancementTarget speech extraction
T
Tsubasa Ochiai
NTT Corporation, Japan
O
Oldrich Plchot
Brno University of Technology, Czechia
S
Shoko Araki
NTT Corporation, Japan
J
J. Černocký
Brno University of Technology, Czechia