🤖 AI Summary
This study presents the first systematic evaluation of paralinguistic biases—specifically age, gender, and accent—in end-to-end large speech dialogue models (SDMs) across multi-turn real-world decision-making and recommendation tasks. To quantify bias, we propose two novel metrics: the Group Unfairness Score (GUS) and Similarity-Normalized Statistical Rate (SNSR), which reveal the persistence of bias under repeated negative feedback and demonstrate that recommendation tasks exacerbate inter-group disparities. Experiments span state-of-the-art models—including Qwen2.5-Omni, GLM-4-Voice, GPT-4o Audio, and Gemini-2.5-Flash—showing that proprietary models exhibit lower overall bias, whereas open-source models are more sensitive to age and gender, with multi-turn interaction amplifying and entrenching unfair outputs. We release FairDialogue, the first benchmark dataset dedicated to fairness in spoken dialogue, alongside open-source evaluation code, advancing fairness research in the audio modality.
📝 Abstract
While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.