Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

πŸ“… 2025-11-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current large audio-language models (LALMs) suffer from an inability of their speech encoders to jointly model speaker identity and paralinguistic attributes (e.g., emotion, prosody). To address this, we propose a universal speech encoder trained via a multi-task learning framework that jointly optimizes speaker identification, paralinguistic understanding, and cross-modal alignment, augmented by CLAP pretraining to enhance audio–text consistency. A key innovation is a task-balancing mechanism that mitigates objective conflicts: empirical analysis reveals CLAP excels at retrieval but underperforms in paralinguistic modeling, whereas our encoder significantly improves synergy between both capabilities. Our method achieves state-of-the-art performance on both speech retrieval and paralinguistic understanding benchmarks. Moreover, it demonstrates strong compatibility when integrated with large language models. The code, pretrained models, and training recipes will be released as part of the open-source Auden toolkit.

Technology Category

Application Category

πŸ“ Abstract
Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.
Problem

Research questions and friction points this paper is trying to address.

Balancing voice identity and paralinguistic cues in audio encoders
Developing general-purpose voice encoder for nuanced speech understanding
Evaluating training methods for optimal voice representation balance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task training creates balanced voice representations
Contrastive pretraining improves retrieval without paralinguistic understanding
Voice encoder integrates with LLMs for strong performance
πŸ”Ž Similar Papers
No similar papers found.