DAME: Duration-Aware Matryoshka Embedding for Duration-Robust Speaker Verification

๐Ÿ“… 2026-01-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the performance limitations of short-duration speaker verification, which arise from limited discriminative cues and the mismatch between fixed-dimensional embeddings and the varying information content of utterances with different durations. To this end, we propose a model-agnostic, duration-aware Matryoshka embedding framework that constructs nested sub-embeddings aligned with utterance lengthโ€”enabling low-dimensional representations for short utterances and high-dimensional ones to preserve fine-grained details in longer speech. The approach supports both training from scratch and fine-tuning without additional inference overhead and is enhanced with a large-margin fine-tuning strategy. Experiments on VoxCeleb1-O/E/H and VOiCES demonstrate significant reductions in equal error rate for utterances of one second or shorter while maintaining competitive performance on full-length utterances, with strong generalization across diverse encoder architectures.

Technology Category

Application Category

๐Ÿ“ Abstract
Short-utterance speaker verification remains challenging due to limited speaker-discriminative cues in short speech segments. While existing methods focus on enhancing speaker encoders, the embedding learning strategy still forces a single fixed-dimensional representation reused for utterances of any length, leaving capacity misaligned with the information available at different durations. We propose Duration-Aware Matryoshka Embedding (DAME), a model-agnostic framework that builds a nested hierarchy of sub-embeddings aligned to utterance durations: lower-dimensional representations capture compact speaker traits from short utterances, while higher dimensions encode richer details from longer speech. DAME supports both training from scratch and fine-tuning, and serves as a direct alternative to conventional large-margin fine-tuning, consistently improving performance across durations. On the VoxCeleb1-O/E/H and VOiCES evaluation sets, DAME consistently reduces the equal error rate on 1-s and other short-duration trials, while maintaining full-length performance with no additional inference cost. These gains generalize across various speaker encoder architectures under both general training and fine-tuning setups.
Problem

Research questions and friction points this paper is trying to address.

short-utterance speaker verification
duration-robust
embedding capacity
speaker-discriminative cues
fixed-dimensional representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Duration-Aware
Matryoshka Embedding
Speaker Verification
Short-Utterance
Nested Embedding
๐Ÿ”Ž Similar Papers
Youngmoon Jung
Youngmoon Jung
Samsung Research
Deep learningspeaker recognitionvoice activity detectionspeech enhancementspeech synthesis
J
Joon-Young Yang
AI Solution Team, Samsung Research, Seoul, South Korea
Ju-ho Kim
Ju-ho Kim
Samsung Research
Speaker recognitionDeep learning
J
Jaeyoung Roh
AI Solution Team, Samsung Research, Seoul, South Korea
Chang Woo Han
Chang Woo Han
Samsung Research, Samsung Electronics
Speech RecognitionSignal Processing
H
Hoon-Young Cho
AI Solution Team, Samsung Research, Seoul, South Korea