Do Bias Benchmarks Generalise? Evidence from Voice-based Evaluation of Gender Bias in SpeechLLMs

📅 2025-09-24

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study investigates whether gender bias evaluation benchmarks designed for multiple-choice question answering (MCQA) generalize to other tasks—such as long-text generation—and across diverse speech inputs. We fine-tune three SpeechLLMs using LoRA to construct biased models exhibiting stereotypical, counter-stereotypical, or neutral gender tendencies. These models are systematically evaluated for behavioral consistency across multi-format tasks: MCQA, open-ended text generation, and cross-voice-style generation. Results reveal a significant misalignment between bias manifestations in MCQA benchmarks and those in long-text generation, with poor cross-task predictive validity; thus, single-format evaluation fails to reflect real-world fairness. To address this, we propose a “cross-task behavioral transferability” evaluation paradigm, advocating for multi-format, task-coordinated bias assessment frameworks. This work establishes a novel standard and practical methodology for fairness evaluation of SpeechLLMs.

Technology Category

Application Category

📝 Abstract

Recent work in benchmarking bias and fairness in speech large language models (SpeechLLMs) has relied heavily on multiple-choice question answering (MCQA) formats. The model is tasked to choose between stereotypical, anti-stereotypical, or neutral/irrelevant answers given an input speech prompt and an optional text prompt. Such MCQA benchmarks implicitly assume that model performance is consistent across other MCQA tasks, voices, and other task formats such as more realistic, long-form evaluations. In this paper, we probe that assumption. We fine-tune three SpeechLLMs using LoRA adapters to induce specific MCQA behaviours: preference for stereotypical, anti-stereotypical, or neutral/uncertain answers. We then evaluate whether these behaviours generalise to another, distinct MCQA benchmark, and more critically to long-form, creative generation tasks. Our results show that performance on MCQA bias benchmarks fails to reliably predict performances across other MCQA benchmarks, and more importantly across long-form tasks. We conclude that current MCQA bias benchmarks show limited evidence of cross-task generalisation in the speech domain, and also propose an evaluation suite for measuring behaviour transferability in future models and benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating cross-task generalization of bias benchmarks in SpeechLLMs

Testing if MCQA bias assessments predict long-form task performance

Assessing behavior transferability between different bias evaluation formats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned SpeechLLMs using LoRA adapters

Evaluated bias generalization across MCQA benchmarks

Assessed behavior transfer to long-form generation tasks

🔎 Similar Papers

Causally Testing Gender Bias in LLMs: A Case Study on Occupational Bias

2022-12-20North American Chapter of the Association for Computational LinguisticsCitations: 19