🤖 AI Summary
Traditional monolithic speech embeddings conflate multiple attributes—such as linguistic content, speaker identity, dialect, and gender—hindering fine-grained similarity measurement. This work proposes a Factorized Embedding Framework that maps speech into a single embedding whose subspaces explicitly disentangle distinct phonetic attributes. The framework employs a shared acoustic encoder coupled with attribute-specific linear projection heads for each variation axis. Training leverages either knowledge distillation or contrastive learning, augmented by a signed axis-weighting mechanism to flexibly amplify or suppress targeted attributes. To the best of our knowledge, this is the first approach to achieve explicit multi-attribute disentanglement within a unified embedding space. Experiments demonstrate substantial improvements in recall on cross-corpus semantic retrieval tasks and effective mitigation of speaker bias, thereby validating the efficacy of multi-axis controllable similarity metrics.
📝 Abstract
Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate.
We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces correspond to distinct axes of variation.
A shared acoustic encoder feeds per-axis linear projection heads, each trained via distillation from a specialist teacher or a contrastive objective over shared-label pairs.
The resulting embeddings support attribute-conditioned retrieval: similarity is computed as a signed weighted sum over per-axis cosine scores, allowing retrieval that jointly considers what was said and how --or explicitly suppresses one attribute to surface another.
We evaluate on cross-corpus retrieval over corpora sharing the Harvard sentence prompts, demonstrating that signed axis weighting can suppress same-speaker bias and surface semantically matched utterances across recording conditions.