Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

229K/year

🤖 AI Summary

This study reveals a hierarchical organization within the latent representations of speaker recognition neural networks, enhancing model interpretability. To analyze this structure, the authors propose a hierarchical clustering approach that integrates SLINK and HDBSCAN, along with a novel semantic matching algorithm—Hierarchical Cluster–Concept Matching (HCCM)—to associate discovered clusters with semantic attributes such as gender and geographic origin. They further introduce the Liebig score to quantitatively assess the alignment between clusters and semantic concepts. Experimental results provide the first empirical validation of hierarchical clustering in speaker embedding spaces, successfully mapping certain clusters to single or composite semantic categories (e.g., “male + British”) and identifying key factors that constrain matching performance.

Technology Category

Application Category

📝 Abstract

Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls within the Explainable AI (XAI) domain. This paper proposes to study an XAI topic: uncovering unknown organisational patterns in network representations, particularly those representations learned by the speaker recognition network that recognises the speaker identity of utterances. Past studies employed algorithms (e.g. t-distributed Stochastic Neighbour Embedding and K-means) to analyse and visualise how network representations form independent clusters, indicating the presence of flat clustering phenomena within the space defined by these representations. In contrast, this work applies two algorithms -- Single-Linkage Clustering (SLINK) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) -- to analyse how representations form clusters with hierarchical relationships rather than being independent, thereby demonstrating the existence of hierarchical clustering phenomena within the network representation space. To semantically understand the above hierarchical clustering phenomena, a new algorithm, termed Hierarchical Cluster-Class Matching (HCCM), is designed to perform one-to-one matching between predefined semantic classes and hierarchical representation clusters (i.e. those produced by SLINK or HDBSCAN). Some hierarchical clusters are successfully matched to individual semantic classes (e.g. male, UK), while others to conjunctions of semantic classes (e.g. male and UK, female and Ireland). A new metric, Liebig's score, is proposed to quantify the performance of each matching behaviour, allowing us to diagnose the factor that most strongly limits matching performance.

Problem

Research questions and friction points this paper is trying to address.

Explainable AI

Speaker Recognition

Hierarchical Clustering

Latent Representations

Semantic Interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable AI

Hierarchical Clustering

Speaker Recognition