🤖 AI Summary
This study addresses the limitations of existing gender bias evaluations in instructional text-to-speech (ITTS) systems, which often rely on univariate tests and fail to capture the combinatorial effects of social cues. The authors propose a multidimensional prompting framework that systematically integrates social status, occupational stereotypes, and role descriptors, revealing for the first time a binding effect among these dimensions in ITTS outputs. This binding effect indicates that bias arises from a deep coupling between semantic priors embedded in pretrained text encoders and the distributional properties of training data. Through analyses of open-source models, semantic probing, and diversity intervention experiments, the work demonstrates that generic diversity prompts are insufficient to mitigate such entrenched biases, underscoring the necessity of compositional analysis for diagnosing latent risks in synthetic speech and establishing a critical link between semantic priors in pretrained encoders and biased voice generation.
📝 Abstract
Current bias evaluations in Instruction Text-to-Speech (ITTS) often rely on univariate testing, overlooking the compositional structure of social cues. In this work, we investigate gender bias by modeling prompts as combinations of Social Status, Career stereotypes, and Persona descriptors. Analyzing open-source ITTS models, we uncover systematic interaction effects where social dimensions modulate one another, creating complex bias patterns missed by univariate baselines. Crucially, our findings indicate that these biases extend beyond surface-level artifacts, demonstrating strong associations with the semantic priors of pre-trained text encoders and the skewed distributions inherent in training data. We further demonstrate that generic diversity prompting is insufficient to override these entrenched patterns, underscoring the need for compositional analysis to diagnose latent risks in generative speech.