π€ AI Summary
Current large audio-language models (LALMs) suffer from an inability of their speech encoders to jointly model speaker identity and paralinguistic attributes (e.g., emotion, prosody). To address this, we propose a universal speech encoder trained via a multi-task learning framework that jointly optimizes speaker identification, paralinguistic understanding, and cross-modal alignment, augmented by CLAP pretraining to enhance audioβtext consistency. A key innovation is a task-balancing mechanism that mitigates objective conflicts: empirical analysis reveals CLAP excels at retrieval but underperforms in paralinguistic modeling, whereas our encoder significantly improves synergy between both capabilities. Our method achieves state-of-the-art performance on both speech retrieval and paralinguistic understanding benchmarks. Moreover, it demonstrates strong compatibility when integrated with large language models. The code, pretrained models, and training recipes will be released as part of the open-source Auden toolkit.
π Abstract
Human voice encodes both identity and paralinguistic cues, yet encoders in large audio-language models (LALMs) rarely balance both aspects. In this work, we present a study toward building a general-purpose voice encoder that captures nuanced voice cues. Through a comprehensive evaluation, we find that multi-task training yields the most balanced representations, whereas contrastive language-audio pretraining (CLAP) primarily improves retrieval without enhancing paralinguistic understanding. Our final encoder, Auden-Voice, also demonstrates strong performance when integrated with LLMs. The code and training recipes will be released with the audio understanding toolkit Auden.