🤖 AI Summary
This work addresses the lack of interpretability in speech representations by proposing Vo-Ve—a speaker-identity-oriented, interpretable speech vector embedding. Methodologically, Vo-Ve explicitly models speaker embeddings as probabilistic distributions over acoustic attributes (e.g., pitch, formants, articulation clarity), departing from conventional black-box feature vectors. It employs a deep neural network to jointly model acoustic features and attribute semantics, enabling end-to-end learning of embeddings that are both discriminative and interpretable. Experiments demonstrate that Vo-Ve achieves performance on par with state-of-the-art speaker embeddings (e.g., x-vector, ECAPA-TDNN) on speaker similarity evaluation, while enabling fine-grained, attribute-level interpretation—such as quantifying that “speaker similarity arises primarily from comparable fundamental frequency and nasality distributions.” This work establishes a novel interpretability paradigm for trustworthy speech recognition and human–machine interaction.
📝 Abstract
In this paper, we propose Vo-Ve, a novel voice-vector embedding that captures speaker identity. Unlike conventional speaker embeddings, Vo-Ve is explainable, as it contains the probabilities of explicit voice attribute classes. Through extensive analysis, we demonstrate that Vo-Ve not only evaluates speaker similarity competitively with conventional techniques but also provides an interpretable explanation in terms of voice attributes. We strongly believe that Vo-Ve can enhance evaluation schemes across various speech tasks due to its high-level explainability.