Learning Multiple Utterance-Level Attribute Representations with a Unified Speech Encoder

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses a key limitation of existing speech foundation models, which typically produce frame-level representations that are ill-suited for multimodal or multilingual tasks requiring utterance-level attributes such as semantics or speaker identity. To overcome this, the authors propose a unified post-training framework that builds upon a self-supervised speech encoder and employs supervised fine-tuning to jointly learn multiple types of utterance-level embeddings—including semantic and speaker characteristics—within a single encoder. This approach achieves, for the first time, unified modeling of diverse utterance-level properties using one shared architecture. Experimental results demonstrate significant improvements in both semantic alignment for cross-lingual speech retrieval and speaker identification accuracy, thereby enhancing the generalizability and representational capacity of speech foundation models in multitask settings.

Technology Category

Application Category

📝 Abstract

Speech foundation models trained with self-supervised learning produce generic speech representations that support a wide range of speech processing tasks. When further adapted with supervised learning, these models can achieve strong performance on specific downstream tasks. Recent post-training approaches, such as SAMU-XSLR and SONAR, align speech representations with utterance-level semantic representations, enabling effective multimodal (speech-text) and multilingual applications. While speech foundation models typically learn contextual embeddings at the acoustic frame level, these methods learn representations at the utterance level. In this work, we extend this paradigm to arbitrary utterance-level attributes and propose a unified post-training framework that enables a single speech foundation model to generate multiple types of utterance-level representations. We demonstrate the effectiveness of this approach by jointly learning semantic and speaker representations and evaluating them on multilingual speech retrieval and speaker recognition tasks.

Problem

Research questions and friction points this paper is trying to address.

utterance-level representation

speech foundation model

multimodal alignment

speaker representation

semantic representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified speech encoder

utterance-level attribute representations

multimodal alignment