TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

📅 2025-04-06

🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing benchmark evaluations for self-supervised speech models focus predominantly on single-speaker scenarios, failing to reflect their real-world capability in noisy, multi-speaker environments for target-speaker identification and information extraction. Method: We introduce TS-SUPERB—the first target-speaker-oriented benchmark for multi-speaker noisy conditions—encompassing four tasks: enrollment-guided speech separation, recognition, verification, and synthesis. We propose a unified SSL-based target-speech encoder architecture that jointly optimizes speaker encoding and speech extraction modules, and develop an end-to-end framework integrating self-supervised representations, speaker-embedding-conditioned decoding, and multi-task joint training. Contribution/Results: Experiments reveal that single-speaker performance is not predictive of target-speaker task performance. Our approach achieves significant improvements over prior state-of-the-art across multiple TS tasks, demonstrating the efficacy of cross-task information sharing and joint modeling.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Evaluating SSL models in noisy multi-talker target-speaker scenarios

Benchmarking target-speaker processing tasks with enrollment embeddings

Investigating joint optimization across target-speaker tasks using SSL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TS-SUPERB benchmark for target-speaker tasks

Uses speaker embedding to condition downstream models

Proposes unified SSL-based target speech encoder

🔎 Similar Papers

No similar papers found.

Together AI

$200,000 - $260,000 + equity + benefits

San Francisco / San Francisco, San Francisco, California, United States

AIML - Sr Machine Learning Engineer - Data and ML Innovation

Apple

Cupertino, United States of America

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs