Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Traditional monolithic speech embeddings conflate multiple attributes—such as linguistic content, speaker identity, dialect, and gender—hindering fine-grained similarity measurement. This work proposes a Factorized Embedding Framework that maps speech into a single embedding whose subspaces explicitly disentangle distinct phonetic attributes. The framework employs a shared acoustic encoder coupled with attribute-specific linear projection heads for each variation axis. Training leverages either knowledge distillation or contrastive learning, augmented by a signed axis-weighting mechanism to flexibly amplify or suppress targeted attributes. To the best of our knowledge, this is the first approach to achieve explicit multi-attribute disentanglement within a unified embedding space. Experiments demonstrate substantial improvements in recall on cross-corpus semantic retrieval tasks and effective mitigation of speaker bias, thereby validating the efficacy of multi-axis controllable similarity metrics.

📝 Abstract

Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation. A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs. The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how --or explicitly suppresses one attribute to surface another. We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions.

Problem

Research questions and friction points this paper is trying to address.

speech embeddings

attribute disentanglement

multi-axis similarity

speaker bias

cross-corpus retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

factor-partitioned embeddings

multi-axis speech similarity

attribute disentanglement