Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the susceptibility of current audio large language models to textual semantic interference, which often leads them to overlook critical acoustic cues in paralinguistic understanding tasks such as tone and emotion recognition. To systematically expose this limitation, the authors introduce VoxParadox, an adversarial evaluation benchmark comprising synthetically generated speech samples where linguistic content and vocal style are deliberately mismatched. They propose a joint optimization strategy combining Prompt-Conditioned Layer Mixer (PCLM) for adaptive fusion of multi-layer audio representations and Direct Preference Optimization (DPO) for alignment with human preferences. Experimental results demonstrate that this approach substantially improves performance, increasing accuracy on VoxParadox from 17.40% to 65.20% for Audio Flamingo 3 and boosting scores on the MMSU paralinguistic subset from 37.74% to 54.78%.

📝 Abstract

Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Audio LLMs

paralinguistic understanding

speech synthesis

adversarial benchmark

acoustic cues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio LLMs

paralinguistic understanding

VoxParadox