๐ค AI Summary
Traditional speaker recognition systems are limited to classification or embedding extraction, failing to generate structured, context-rich speaker profilesโsuch as dialect, gender, and age. This work proposes the first descriptive Speaker Language Model (SLM), introducing a collaborative architecture comprising a speaker encoder and a prompt-driven decoder to enable a paradigm shift from raw speech to natural-language speaker profiling. Our method integrates a contrastive-learning-based encoder, an LLM-adaptation interface, and editable prompt templates, supporting zero-shot cross-domain generation via instruction fine-tuning and embedding-conditioned prompting. Evaluated on multi-source datasets, SLM achieves 82.4% zero-shot descriptive accuracy and improves F1-score by 37.6% over strong baselines. To our knowledge, this is the first approach enabling fine-grained, customizable generation of speaker attributes directly from speech.
๐ Abstract
Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.