Introducing voice timbre attribute detection

📅 2025-05-14
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the voice timbre attribute detection (vTAD) task, aiming to enable comparable and interpretable quantitative modeling of timbre using human-understandable perceptual attributes (e.g., “bright”, “hoarse”). Methodologically, it frames timbre perception differences as a contrastive attribute discrimination problem within speaker embedding space—the first such formulation—and constructs VCTK-RVA, the first dedicated vTAD benchmark dataset. It systematically evaluates two speaker encoders—ECAPA-TDNN and FACodec—revealing that ECAPA-TDNN excels on seen speakers, whereas FACodec demonstrates superior generalization to unseen speakers. Key contributions include: (1) formal definition of the vTAD task; (2) public release of the VCTK-RVA dataset and associated open-source code; and (3) empirical characterization of fundamental differences in generalization behavior between speaker encoders, thereby establishing a foundation for interpretable timbre analysis. (149 words)

Technology Category

Application Category

📝 Abstract
This paper focuses on explaining the timbre conveyed by speech signals and introduces a task termed voice timbre attribute detection (vTAD). In this task, voice timbre is explained with a set of sensory attributes describing its human perception. A pair of speech utterances is processed, and their intensity is compared in a designated timbre descriptor. Moreover, a framework is proposed, which is built upon the speaker embeddings extracted from the speech utterances. The investigation is conducted on the VCTK-RVA dataset. Experimental examinations on the ECAPA-TDNN and FACodec speaker encoders demonstrated that: 1) the ECAPA-TDNN speaker encoder was more capable in the seen scenario, where the testing speakers were included in the training set; 2) the FACodec speaker encoder was superior in the unseen scenario, where the testing speakers were not part of the training, indicating enhanced generalization capability. The VCTK-RVA dataset and open-source code are available on the website https://github.com/vTAD2025-Challenge/vTAD.
Problem

Research questions and friction points this paper is trying to address.

Detecting voice timbre attributes in speech signals
Comparing timbre intensity between paired speech utterances
Evaluating speaker encoders for seen and unseen scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Voice timbre attribute detection using sensory attributes
Framework based on speaker embeddings extraction
ECAPA-TDNN and FACodec encoders for different scenarios
🔎 Similar Papers
No similar papers found.
J
Jinghao He
NERC-SLIP, University of Science and Technology of China, China
Zhengyan Sheng
Zhengyan Sheng
University of Science and Technology of China
Speech SynthesisMultimodality-driven Speaker Generation
L
Liping Chen
NERC-SLIP, University of Science and Technology of China, China
Kong Aik Lee
Kong Aik Lee
The Hong Kong Polytechnic University, Hong Kong
Speaker and Spoken Language RecognitionSpeech ProcessingDigital Signal ProcessingSubband
Z
Zhenhua Ling
NERC-SLIP, University of Science and Technology of China, China