Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current large audio-language models (LALMs) suffer from an inability of their speech encoders to jointly model speaker identity and paralinguistic attributes (e.g., emotion, prosody). To address this, we propose a universal speech encoder trained via a multi-task learning framework that jointly optimizes speaker identification, paralinguistic understanding, and cross-modal alignment, augmented by CLAP pretraining to enhance audio–text consistency. A key innovation is a task-balancing mechanism that mitigates objective conflicts: empirical analysis reveals CLAP excels at retrieval but underperforms in paralinguistic modeling, whereas our encoder significantly improves synergy between both capabilities. Our method achieves state-of-the-art performance on both speech retrieval and paralinguistic understanding benchmarks. Moreover, it demonstrates strong compatibility when integrated with large language models. The code, pretrained models, and training recipes will be released as part of the open-source Auden toolkit.

Technology Category

Application Category

📝 Abstract

Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.

Problem

Research questions and friction points this paper is trying to address.

Balancing voice identity and paralinguistic cues in audio encoders

Developing general-purpose voice encoder for nuanced speech understanding

Evaluating training methods for optimal voice representation balance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task training creates balanced voice representations

Contrastive pretraining improves retrieval without paralinguistic understanding

Voice encoder integrates with LLMs for strong performance

🔎 Similar Papers

No similar papers found.