DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

This work addresses the performance limitations of short-duration speaker verification, which arise from limited discriminative cues and the mismatch between fixed-dimensional embeddings and the varying information content of utterances with different durations. To this end, we propose a model-agnostic, duration-aware Matryoshka embedding framework that constructs nested sub-embeddings aligned with utterance length—enabling low-dimensional representations for short utterances and high-dimensional ones to preserve fine-grained details in longer speech. The approach supports both training from scratch and fine-tuning without additional inference overhead and is enhanced with a large-margin fine-tuning strategy. Experiments on VoxCeleb1-O/E/H and VOiCES demonstrate significant reductions in equal error rate for utterances of one second or shorter while maintaining competitive performance on full-length utterances, with strong generalization across diverse encoder architectures.

Technology Category

Application Category

📝 Abstract

Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused for utterances of any length, leaving capacity misaligned with the information available at different durations. We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations: lower-dimensional representations capture compact speaker traits from short utterances, while higher dimensions encode richer details from longer speech. DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning, consistently improving performance across durations. On the VoxCeleb1-O/E/H and VOiCES evaluation sets, DAME consistently reduces the equal error rate on 1-s and other short-duration trials, while maintaining full-length performance with no additional inference cost. These gains generalize across various speaker encoder architectures under both general training and fine-tuning setups.

Problem

Research questions and friction points this paper is trying to address.

short-utterance speaker verification

duration-robust

embedding capacity

speaker-discriminative cues

fixed-dimensional representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Duration-Aware

Matryoshka Embedding

Speaker Verification