🤖 AI Summary
This work identifies significant gender and phonation-type (e.g., breathy, creaky) interaction biases in speech foundation models during speech continuation—a novel task requiring coherent single-audio-stream extension. The task explicitly probes speaker similarity preservation, voice quality fidelity, and text-level bias. Systematic evaluation reveals that female prompts disproportionately trigger regression toward modal phonation and exacerbate text-level gender bias. All evaluated models—SpiritLM-base/expressive, VAE-GSLM, and SpeechGPT—exhibit voice-quality bias against female voices; notably, VAE-GSLM manifests the strongest text-level bias after achieving continuity thresholds. This study establishes the first dedicated benchmark for fairness assessment in speech continuation, introducing a new paradigm grounded in empirical evidence to evaluate demographic and phonatory fairness in speech foundation models.
📝 Abstract
Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.