Vo-Ve: An Explainable Voice-Vector for Speaker Identity Evaluation

📅 2025-06-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of interpretability in speech representations by proposing Vo-Ve—a speaker-identity-oriented, interpretable speech vector embedding. Methodologically, Vo-Ve explicitly models speaker embeddings as probabilistic distributions over acoustic attributes (e.g., pitch, formants, articulation clarity), departing from conventional black-box feature vectors. It employs a deep neural network to jointly model acoustic features and attribute semantics, enabling end-to-end learning of embeddings that are both discriminative and interpretable. Experiments demonstrate that Vo-Ve achieves performance on par with state-of-the-art speaker embeddings (e.g., x-vector, ECAPA-TDNN) on speaker similarity evaluation, while enabling fine-grained, attribute-level interpretation—such as quantifying that “speaker similarity arises primarily from comparable fundamental frequency and nasality distributions.” This work establishes a novel interpretability paradigm for trustworthy speech recognition and human–machine interaction.

Technology Category

Application Category

📝 Abstract
In this paper, we propose Vo-Ve, a novel voice-vector embedding that captures speaker identity. Unlike conventional speaker embeddings, Vo-Ve is explainable, as it contains the probabilities of explicit voice attribute classes. Through extensive analysis, we demonstrate that Vo-Ve not only evaluates speaker similarity competitively with conventional techniques but also provides an interpretable explanation in terms of voice attributes. We strongly believe that Vo-Ve can enhance evaluation schemes across various speech tasks due to its high-level explainability.
Problem

Research questions and friction points this paper is trying to address.

Develops explainable voice-vector for speaker identity
Evaluates speaker similarity with interpretable voice attributes
Enhances speech tasks via explainable embedding techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable voice-vector embedding
Probabilities of voice attributes
Competitive speaker similarity evaluation
🔎 Similar Papers
No similar papers found.
J
Jaejun Lee
Music and Audio Research Group (MARG), Department of Intelligence and Information, Artificial Intelligence Institute, Seoul National University, Republic of Korea
Kyogu Lee
Kyogu Lee
Professor, Seoul National University
Audio Signal ProcessingMachine LearningComputer AuditionAuditory/Music Perception & Cognition