Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study presents the first systematic evaluation of paralinguistic biases—specifically age, gender, and accent—in end-to-end large speech dialogue models (SDMs) across multi-turn real-world decision-making and recommendation tasks. To quantify bias, we propose two novel metrics: the Group Unfairness Score (GUS) and Similarity-Normalized Statistical Rate (SNSR), which reveal the persistence of bias under repeated negative feedback and demonstrate that recommendation tasks exacerbate inter-group disparities. Experiments span state-of-the-art models—including Qwen2.5-Omni, GLM-4-Voice, GPT-4o Audio, and Gemini-2.5-Flash—showing that proprietary models exhibit lower overall bias, whereas open-source models are more sensitive to age and gender, with multi-turn interaction amplifying and entrenching unfair outputs. We release FairDialogue, the first benchmark dataset dedicated to fairness in spoken dialogue, alongside open-source evaluation code, advancing fairness research in the audio modality.

Technology Category

Application Category

📝 Abstract

While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.

Problem

Research questions and friction points this paper is trying to address.

Evaluating bias in spoken dialogue LLMs with audio input and output

Assessing bias impact on real-world decisions and recommendation tasks

Studying bias persistence in multi-turn conversations with feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically evaluates bias in spoken dialogue LLMs

Measures bias using Group Unfairness Score metrics

Analyzes multi-turn conversations with repeated negative feedback

🔎 Similar Papers

LangBiTe: A Platform for Testing Bias in Large Language Models