🤖 AI Summary
This study addresses the lack of interpretability in voice timbre attribute detection by introducing, for the first time, a timbre dimension intensity comparison task—designed to quantitatively assess the relative strength of two speech samples along specific timbral descriptors (e.g., “bright”, “hoarse”). Leveraging the VCTK-RVA dataset, we establish a unified evaluation framework integrating speech feature extraction, an interpretable deep learning model, and a comparative scoring mechanism. Six teams participated in the benchmark evaluation; five submitted detailed method descriptions, enabling systematic validation of diverse modeling strategies in terms of timbral semantic alignment and cross-sample comparability. Our work establishes the first benchmark task and data protocol explicitly designed for interpretable timbre analysis. It further fosters interdisciplinary advancement at the intersection of speech perception modeling and computational timbre research.
📝 Abstract
The first voice timbre attribute detection challenge is featured in a special session at NCMMSC 2025. It focuses on the explainability of voice timbre and compares the intensity of two speech utterances in a specified timbre descriptor dimension. The evaluation was conducted on the VCTK-RVA dataset. Participants developed their systems and submitted their outputs to the organizer, who evaluated the performance and sent feedback to them. Six teams submitted their outputs, with five providing descriptions of their methodologies.